Placeholder Image

Subtitles section Play video

  • Chris Cotsapas: I’d like to thank the organizers for the

  • opportunity to come talk to you guys about what weve been thinking about in my lab.

  • So, what I’m going to talk about primarily is stuff that’s going on. So, all of this

  • is unpublished. Feel free to think about it, share it, whatever. But it’s very much work

  • in progress. Some of it is hot off the press. So, do take it with a pinch of salt. So, what

  • we think about a lot is autoimmune diseases in my lab. And we kind of want to think about

  • which genes go wrong in disease, and we think about these regulatory genes. But actually

  • what were interested in are the causal genes. And my pointer doesn’t work. I can

  • use this pointer. It’s all coming up Chris today.

  • So, were thinking more about causality than anything else. So, when we say dysregulation,

  • were interested in pathogenesis, right? That’s ultimately what were after. And

  • so, just a 30,000 foot view of the immune system. If you remember, you start with a

  • stem cell. You have two major lineages in the immune system, that the lymphoid and the

  • myeloid lineages. So, things like macrophages are all the way down here. And your T cells

  • and B cells are all the way down here. If you think of you think of them as adaptive

  • versus innate. And what happens is every now and then, this goes wrong. So, the immune

  • system’s primary function is to protect the body from things that are foreign. And

  • so it’s got this amazing capacity to tell the difference between your cells and the

  • rest of world. And it’s really good at this, but occasionally it screws up. And it kind

  • of -- what happens is that it starts attacking certain tissues.

  • So, if it doesn’t like myelin, you get multiple sclerosis. The immune systems manage to go

  • into the brain and attack the myelin sheath [spelled phonetically] very specifically around

  • neurons, chew it up, and you get lesions into your brain. You can get things like skin attacks

  • which give you Sjogren’s syndrome, scleroderma, you can get type 1 diabetes, which we now

  • know is an immune disease. If it doesn’t like aspects of the GI tract, you wind up

  • with Crohn’s disease, ulcerative colitis, or celiac disease if it doesn’t like the

  • epithelia joint; specific joint dislikes, should we say, give you rheumatoid arthritis

  • or ankylosing spondylitis. And if it just doesn’t like DNA, if it doesn’t like nucleic

  • acid, it attacks everything, then you wind up something called lupus, right? What’s

  • really interesting is that these are very, very, specific dislikes. So, MS is not rheumatoid

  • arthritis. It’s a very specific attack against myelin. It’s not a specific attack against

  • anything else.

  • And what we really want to understand is what these diseases are. So, something’s going

  • wrong with the immune system. We don’t really understand what it is. What we do know is

  • that all of these diseases are common. Theyre complex genetic diseases. There’s a large

  • portion of heritability. They track in families. But theyre not Mendelian. It’s not one

  • catastrophic mutation, right? And, of course, as GWAS came along, I’m going to talk about

  • multiple sclerosis, which is something that I work on. But you can take this as read for

  • any immune disease. As GWAS came along, we hadn’t really gotten a lot of traction on

  • the genetics of these diseases. And then, sort of we barely managed to identify two

  • loci in the genome in one of the first GWAS studies. Then a little while later, we managed

  • to get another one. A meta-analysis of these two sets of studies from international consortia

  • kind of gave six new hits, and were starting to climb this power curve of discovery.

  • Then a further meta-analysis with more markers and a few more samples gave us an additional

  • three new hits. Even more samples gave us another 25 new hits. The immunochip gave us

  • 47. That took us up to 100. And our current studies, which are about 16,000 cases, 26,000

  • controls and replication in another 36,000 samples, weve got another 100 odd new hits.

  • So, were standing at around 200 loci right now in GWAS, right? That explains -- including

  • the HLA -- it explains about 55 percent of the heritability. We estimate that in the

  • common space there’s probably another 600 to 800 loci that we don’t know about yet.

  • We kind of do know about them. Theyre not genome-wide significant yet. But we know theyre

  • there. And we know the approximate complexity of the disease is about 1,000 independent

  • variants.

  • And so, when ENCONDE came along and we did -- we were a very small part of this paper

  • from John Stam sort of showing that in Crohn’s disease and in multiple sclerosis, there is

  • strong enrichment of the risk SNPs on regulatory regions active in very specific subsets of

  • the immune cells. And in multiple sclerosis in particular, you can see CD3 cells, CD19s,

  • B lymphocytes, and CD14s, which is interesting. There’s a lot of pathogenesis coming out

  • of T cells as well. But these are more B cell like. And so, dysregulation in multiple sets

  • of immune cells seems to be an issue here. But this kind of sends us chasing down this

  • idea that is now extremely common. And this is one of the great right, right? So, 10 years

  • ago GWAS wasn’t going to work. And five years ago, everyone was asking why we haven’t

  • solved disease yet. Five years ago, everything was coding. And now, everything is now regulatory.

  • And it seems really obvious. But even two, three years ago, this was not that obvious.

  • And so, this chases us down -- starts us chasing down this rabbit hole of which genes are getting

  • dysregulated and how does that cause disease. And so, that’s what we are going to talk

  • about today -- further evidence that in specific immune cells, you get dysregulation that maps

  • into specific transcription factor binding sites as is from Kyle Farh and Brad Bernstein

  • showing that the MS SNPs are particularly enriched for NF-kappa B transcription factor

  • ChIP-seq peaks for instance. And so, there’s something that’s fairly specific dysregulation

  • in immune cells, which is great in bulk, hard when you actually want to identify specific

  • effects on specific genes in specific cells. And so, that’s the task at hand. And so,

  • when you look at some of the loci, you know, you put up a GWAS locus. Here’s a classic

  • locus in MS. Well, there’s NF-kappa B one and mannose-binding protein A. And you could

  • sort of make a case for mannose-binding protein A, but really everyone’s going to assume

  • that NF-kappa B one is one is the appropriate gene. And it turns out that that’s right

  • for various reasons. And so you can start working on that because you kind of are reasonably

  • sure that’s the gene.

  • When you look at another locus of course, that gets a lot more difficult. Youve got

  • this big association peak. There’s a bunch of genes in here, and the problem isn’t

  • that theyre not good candidates. There’s a bunch of good candidates in here. ORMDL3

  • is here. IKZF3, which is Helios, which is a transcription factor that controls T regulatory

  • cell differentiation is there. A bunch of other immune cells. And so, youre kind

  • of going, “What’s going on here?” So, we kind of thought, “Okay. If there is regulation,

  • and we have SNPS, how do we unite the genetics with the epigenomics?” And a lot of people

  • are thinking about this. Youre going to hear a lot more stories about this. Youre

  • already heard some. Here’s how weve been thinking about it.

  • So, were kind of amateur math geeks, and so we thinking about how we can transfer some

  • of this probability and do some functional fine mapping. So, you have a set of SNPs in

  • the genome. Were going to talk about hypersensitive sites now. But instead of DHS, you can think

  • of any regulatory mark. Weve been working a lot with hypersensitive sites because we

  • like them. Theyre stable. Theyre nice. They tell you a lot. Were going to expand

  • this to the other sets. But think about DHS for now. And youve a gene in the locus.

  • So, this is my like tiered view of a locus.

  • So, each of these guys is associated to disease. And -- oh, this is going to chop off my -- thanks.

  • Oh well. So, what that says is posterior probability of association or PPA, okay? So, when you

  • do a GWAS for each of these SNPs, you get a P value of whether it’s associated to

  • disease or not. You can convert that simple P value into basically a posterior probability

  • which tells you, what is the likelihood that this SNPs is the one driving the signal, okay?

  • Were not going to talk about the math magic that underlies that. I’ll bore you with

  • it in person over a coffee if you like. But basically, for each of these SNPs, you can

  • do a magical transformation and get the probability that that’s the SNP that’s driving signal.

  • If it’s very associated, and nothing else is associated, it’s going to be really probable

  • to drive the signal. If there’s a whole bunch of SNPs that are equally associated,

  • youre going to have to spread the probability that it’s caused all over all of those guys,

  • right? That’s the intuition here. So, of course, some of these SNPs are actually on

  • DHSs. And so, you can transfer that probability. I can’t even talk anymore, sorry. That probability

  • to the DHS. You could also do something fancy like say this guys is about this far away

  • from this DHS, so I’m going to give it some proportion here. That’s -- were not doing

  • that right now. But basically, what I can do is come up with a way to score every regulatory

  • region for what their probability of explaining what the association in that region is, right?

  • And if I sum every one of those -- of course not every SNP is on those -- but if I sum

  • all of these posteriors, that gives me the global probability that, in this locus, association

  • is mediated by these regulatory regions. Doesn’t have to be all of it. But if most of the signal

  • is on DHSs, then youre going to get a high percentage, right? It’s going to be close

  • to one. If it doesn’t look like it’s being mediated by regulatory regions, youre going

  • to get a low proportion.

  • So much is easy. What’s cool is you can get think about how you correlate these guys

  • to the genes they control. So, if I had a magic way of saying, “Well, this DHS is

  • correlated to this gene this much, then I can wait how much of the posterior of association

  • gets transferred into this gene, right?” So, if this guy’s perfectly correlated -- if

  • this is what determines whether this gene is expressed -- then if this explains all

  • of the association to a trait, then presumably, it’s active on this gene. Because the DHS

  • isn’t just a DHS. It’s regulating something, right? So, that’s the intuition. And you

  • partition this all this way. And what it says here is CP times PPA, okay? So, that’s just

  • the correlation posterior between this DHS and this gene times how much weight youve

  • given it from the association data. And that way, you wind up building this model of this

  • gene posterior. So, if I sum all of these, all of the contribution of each DHS from the

  • SNPs going into this gene, I can get a sense of what the probability that this gene is

  • driving association in this region is. And I can do that for any gene.

  • So, I now derive a score basically for how likely this gene is to be pathogenic, if that

  • pathogenesis is mediated by DHS regions. And we know theyre enriched, so that’s a

  • reasonable hypothesis, okay? It’s not the only way to do it, but it’s one way to think

  • about this. And so, you have to solve a couple of technical problems to do this. One is,

  • youve got to correlate your DHSs to your genes. And so, that’s really simple. You

  • just observe if there’s a peak, and what the level of expression of a gene is, and

  • then you do a correlation, on-off versus level of expression of a gene. And you do that for

  • each DHS you find.

  • Two issues. First of all, youve got to decide what the same DHS is. And secondly,

  • you need measurements where youve measured both DHS and gene expression, okay? So, to

  • do this thing, we use an alignment approach. This is what real DHS data looks like out

  • of hotspot. These are peaks. This is an arbitrary part of the genome and your job is to figure

  • which ones of these represent the same element across samples. Were not terribly good

  • at that as human beings. Fortunately, computers are a lot better at this than we are.

  • So, you can put it in a clustering approach and kind of decide that these look the same

  • that are a little jittered, but they kind of look similar. And then these guys are kind

  • of the same, but youre may be a little less confident because there’s more spread.

  • And these guys are kind of the same as well, but there’s even more spread, okay? And

  • the way we do this is with mark-off clustering. It’s a way to cluster stuff. There are other

  • ways to do it. It work reasonably well. And the way you think about this -- oh, and that’s

  • gotten chopped up as well. That’s brilliant. Okay. So, one way you might want to do this

  • is to say, is this detectable? And so, you go into the Roadmap data, and fortunately

  • there are replicates.

  • And here’s my assertion. If I see a peak here in replica one of a tissue, then I should

  • expect to see that peak in replica two of a tissue as well, right? Biologically replication

  • just as we do in any other experiment. Really simple. And so once I decide this is my cluster,

  • that’s what comes out of the algorithm, you don’t just go and apply that mindlessly

  • to data. That’s not how you do analysis, right? You check and you see what you can

  • detect. And of course, the wider and the sloppier this peak is, the less likely it is to be

  • true. And so you can do a statistical test. And so, once youve decided what the cluster

  • is, if there’s a peak anywhere in that cluster, you mark that sample as a one. And if there’s

  • no peak, you mark it as a zero. If you have replicates where the labels somewhere over

  • here on that wall, you can then say, “Okay, do I" -- "if I see ones in both replicates

  • I’m going to score that tissue as a two. I’m going to score it as a one if there’s

  • only replicate." So, if itsdiscordant. "And I must -- I’m going to score it as

  • a zero if there’s none there.” And then you can do a test.

  • So, I’ve done this without knowing about replicas. And then I add the information about

  • what goes with what and I ask, “Are they consistent?” So, if I get things likeLook,

  • in cell type one, I get a one. And in two, I get a one. I get all ones.” That suggests

  • this isn’t consistent. It’s not replicating. And if I get a lot of twos a lot of zeros

  • and very few ones, that looks consistent. So, it’s replicating. It’s either not

  • there or it’s there. And so I can do a statistical test. It’s not terribly important what the

  • test is. It’s a simple chi-square approximation. We do this over 57 tissue replicates. So,

  • from Roadmap. And we find that just feeding this in when we cluster, we can get about

  • a million out of 1.99 million. So, about 54 percent of our clusters pass are fairly stringent

  • threshold -- a fairly lenient threshold. And that’s because very often these things are

  • kind of diffuse. The clusters don’t really look good. And so, were probably not doing

  • great at the clustering, and it’s unreliable, right? There’s also a bunch of singleton

  • in these data that get thrown out because they don’t replicate. But most of this is

  • actually the clustering.

  • So, we can get about a million features about the genome. And we don’t worry about recovering

  • more stuff and improving the clustering. Right now, were just working with these million.

  • So, these other thing is, youve got relatively low power. And so, what’s nice about this

  • is this -- what you can clearly read here -- what you can do is estimate how much the

  • heritability youre still explaining. So, this is just a sanity check. If you use all

  • of these clusters, it’s about 14 percent of the genome, and it explains a proportion

  • of heritability. And what I want to know is if I reduce this to the half of the clusters

  • that I’m using now, what proportion of heritability am I still explaining? And to a first approximation,

  • what you can see here is in red is all the peaks and in blue is just the clusters that

  • we define. Pretty much were capturing all of the signal. It varies as wiggle room. There’s

  • a little bit of error on these things, but were capturing just about all of the heritability.

  • But weve gone from 14 percent of genome to 8 percent of the genome.

  • So, rather than do the 500 base pair either side, which is what most of previous heritability

  • estimates have done, which a lot of the summary papers have kind of shown, “Oh, there’s

  • enrichment in DHS or in regulatory regions or whatever.” But they actually bracket

  • each feature by 500 bases. And so, they cover 50 percent of the genome. So yes, all of the

  • heritability is explained by 50 percent of the genome. I’m telling you that a lot of

  • the heritability’s explained by eight percent of the genome. So, it’s a lot bit more specific.

  • And so, the second challenge is to now correlate these guys, now that weve decided what

  • clusters are, to correlate them to gene expression. So, you need matched data. We use 22 sets

  • of matched DHS and exon array data from Roadmap again. And the problem is, there’s massive

  • inflation because gene expression data of course is highly correlated. And so you just

  • get this massive inflation in the expected distribution of these tests. And we can correct

  • this. We just go through and normalize it and basically, you kind of start off with

  • this massive inflation. I’m showing you lambda here. It’s supposed to be a nice

  • straight line here. And we can correct all of that out.

  • So, now that we have all of these statistics, we can go back and do our little approach.

  • So, now we have this part. We already have this part from credible interval, set mapping,

  • and posterior estimation. And we can now estimate gene-wide scores. And so, big red exclamation

  • point here you can see means this is really fresh, as in last Friday’s results. Hot

  • off the presses. Here is a region. It -- were talking about MS GWAS. This is actually the