Subtitles section Play video
-
Chris Cotsapas: I’d like to thank the organizers for the
-
opportunity to come talk to you guys about what we’ve been thinking about in my lab.
-
So, what I’m going to talk about primarily is stuff that’s going on. So, all of this
-
is unpublished. Feel free to think about it, share it, whatever. But it’s very much work
-
in progress. Some of it is hot off the press. So, do take it with a pinch of salt. So, what
-
we think about a lot is autoimmune diseases in my lab. And we kind of want to think about
-
which genes go wrong in disease, and we think about these regulatory genes. But actually
-
what we’re interested in are the causal genes. And my pointer doesn’t work. I can
-
use this pointer. It’s all coming up Chris today.
-
So, we’re thinking more about causality than anything else. So, when we say dysregulation,
-
we’re interested in pathogenesis, right? That’s ultimately what we’re after. And
-
so, just a 30,000 foot view of the immune system. If you remember, you start with a
-
stem cell. You have two major lineages in the immune system, that the lymphoid and the
-
myeloid lineages. So, things like macrophages are all the way down here. And your T cells
-
and B cells are all the way down here. If you think of you think of them as adaptive
-
versus innate. And what happens is every now and then, this goes wrong. So, the immune
-
system’s primary function is to protect the body from things that are foreign. And
-
so it’s got this amazing capacity to tell the difference between your cells and the
-
rest of world. And it’s really good at this, but occasionally it screws up. And it kind
-
of -- what happens is that it starts attacking certain tissues.
-
So, if it doesn’t like myelin, you get multiple sclerosis. The immune systems manage to go
-
into the brain and attack the myelin sheath [spelled phonetically] very specifically around
-
neurons, chew it up, and you get lesions into your brain. You can get things like skin attacks
-
which give you Sjogren’s syndrome, scleroderma, you can get type 1 diabetes, which we now
-
know is an immune disease. If it doesn’t like aspects of the GI tract, you wind up
-
with Crohn’s disease, ulcerative colitis, or celiac disease if it doesn’t like the
-
epithelia joint; specific joint dislikes, should we say, give you rheumatoid arthritis
-
or ankylosing spondylitis. And if it just doesn’t like DNA, if it doesn’t like nucleic
-
acid, it attacks everything, then you wind up something called lupus, right? What’s
-
really interesting is that these are very, very, specific dislikes. So, MS is not rheumatoid
-
arthritis. It’s a very specific attack against myelin. It’s not a specific attack against
-
anything else.
-
And what we really want to understand is what these diseases are. So, something’s going
-
wrong with the immune system. We don’t really understand what it is. What we do know is
-
that all of these diseases are common. They’re complex genetic diseases. There’s a large
-
portion of heritability. They track in families. But they’re not Mendelian. It’s not one
-
catastrophic mutation, right? And, of course, as GWAS came along, I’m going to talk about
-
multiple sclerosis, which is something that I work on. But you can take this as read for
-
any immune disease. As GWAS came along, we hadn’t really gotten a lot of traction on
-
the genetics of these diseases. And then, sort of we barely managed to identify two
-
loci in the genome in one of the first GWAS studies. Then a little while later, we managed
-
to get another one. A meta-analysis of these two sets of studies from international consortia
-
kind of gave six new hits, and we’re starting to climb this power curve of discovery.
-
Then a further meta-analysis with more markers and a few more samples gave us an additional
-
three new hits. Even more samples gave us another 25 new hits. The immunochip gave us
-
47. That took us up to 100. And our current studies, which are about 16,000 cases, 26,000
-
controls and replication in another 36,000 samples, we’ve got another 100 odd new hits.
-
So, we’re standing at around 200 loci right now in GWAS, right? That explains -- including
-
the HLA -- it explains about 55 percent of the heritability. We estimate that in the
-
common space there’s probably another 600 to 800 loci that we don’t know about yet.
-
We kind of do know about them. They’re not genome-wide significant yet. But we know they’re
-
there. And we know the approximate complexity of the disease is about 1,000 independent
-
variants.
-
And so, when ENCONDE came along and we did -- we were a very small part of this paper
-
from John Stam sort of showing that in Crohn’s disease and in multiple sclerosis, there is
-
strong enrichment of the risk SNPs on regulatory regions active in very specific subsets of
-
the immune cells. And in multiple sclerosis in particular, you can see CD3 cells, CD19s,
-
B lymphocytes, and CD14s, which is interesting. There’s a lot of pathogenesis coming out
-
of T cells as well. But these are more B cell like. And so, dysregulation in multiple sets
-
of immune cells seems to be an issue here. But this kind of sends us chasing down this
-
idea that is now extremely common. And this is one of the great right, right? So, 10 years
-
ago GWAS wasn’t going to work. And five years ago, everyone was asking why we haven’t
-
solved disease yet. Five years ago, everything was coding. And now, everything is now regulatory.
-
And it seems really obvious. But even two, three years ago, this was not that obvious.
-
And so, this chases us down -- starts us chasing down this rabbit hole of which genes are getting
-
dysregulated and how does that cause disease. And so, that’s what we are going to talk
-
about today -- further evidence that in specific immune cells, you get dysregulation that maps
-
into specific transcription factor binding sites as is from Kyle Farh and Brad Bernstein
-
showing that the MS SNPs are particularly enriched for NF-kappa B transcription factor
-
ChIP-seq peaks for instance. And so, there’s something that’s fairly specific dysregulation
-
in immune cells, which is great in bulk, hard when you actually want to identify specific
-
effects on specific genes in specific cells. And so, that’s the task at hand. And so,
-
when you look at some of the loci, you know, you put up a GWAS locus. Here’s a classic
-
locus in MS. Well, there’s NF-kappa B one and mannose-binding protein A. And you could
-
sort of make a case for mannose-binding protein A, but really everyone’s going to assume
-
that NF-kappa B one is one is the appropriate gene. And it turns out that that’s right
-
for various reasons. And so you can start working on that because you kind of are reasonably
-
sure that’s the gene.
-
When you look at another locus of course, that gets a lot more difficult. You’ve got
-
this big association peak. There’s a bunch of genes in here, and the problem isn’t
-
that they’re not good candidates. There’s a bunch of good candidates in here. ORMDL3
-
is here. IKZF3, which is Helios, which is a transcription factor that controls T regulatory
-
cell differentiation is there. A bunch of other immune cells. And so, you’re kind
-
of going, “What’s going on here?” So, we kind of thought, “Okay. If there is regulation,
-
and we have SNPS, how do we unite the genetics with the epigenomics?” And a lot of people
-
are thinking about this. You’re going to hear a lot more stories about this. You’re
-
already heard some. Here’s how we’ve been thinking about it.
-
So, we’re kind of amateur math geeks, and so we thinking about how we can transfer some
-
of this probability and do some functional fine mapping. So, you have a set of SNPs in
-
the genome. We’re going to talk about hypersensitive sites now. But instead of DHS, you can think
-
of any regulatory mark. We’ve been working a lot with hypersensitive sites because we
-
like them. They’re stable. They’re nice. They tell you a lot. We’re going to expand
-
this to the other sets. But think about DHS for now. And you’ve a gene in the locus.
-
So, this is my like tiered view of a locus.
-
So, each of these guys is associated to disease. And -- oh, this is going to chop off my -- thanks.
-
Oh well. So, what that says is posterior probability of association or PPA, okay? So, when you
-
do a GWAS for each of these SNPs, you get a P value of whether it’s associated to
-
disease or not. You can convert that simple P value into basically a posterior probability
-
which tells you, what is the likelihood that this SNPs is the one driving the signal, okay?
-
We’re not going to talk about the math magic that underlies that. I’ll bore you with
-
it in person over a coffee if you like. But basically, for each of these SNPs, you can
-
do a magical transformation and get the probability that that’s the SNP that’s driving signal.
-
If it’s very associated, and nothing else is associated, it’s going to be really probable
-
to drive the signal. If there’s a whole bunch of SNPs that are equally associated,
-
you’re going to have to spread the probability that it’s caused all over all of those guys,
-
right? That’s the intuition here. So, of course, some of these SNPs are actually on
-
DHSs. And so, you can transfer that probability. I can’t even talk anymore, sorry. That probability
-
to the DHS. You could also do something fancy like say this guys is about this far away
-
from this DHS, so I’m going to give it some proportion here. That’s -- we’re not doing
-
that right now. But basically, what I can do is come up with a way to score every regulatory
-
region for what their probability of explaining what the association in that region is, right?
-
And if I sum every one of those -- of course not every SNP is on those -- but if I sum
-
all of these posteriors, that gives me the global probability that, in this locus, association
-
is mediated by these regulatory regions. Doesn’t have to be all of it. But if most of the signal
-
is on DHSs, then you’re going to get a high percentage, right? It’s going to be close
-
to one. If it doesn’t look like it’s being mediated by regulatory regions, you’re going
-
to get a low proportion.
-
So much is easy. What’s cool is you can get think about how you correlate these guys
-
to the genes they control. So, if I had a magic way of saying, “Well, this DHS is
-
correlated to this gene this much, then I can wait how much of the posterior of association
-
gets transferred into this gene, right?” So, if this guy’s perfectly correlated -- if
-
this is what determines whether this gene is expressed -- then if this explains all
-
of the association to a trait, then presumably, it’s active on this gene. Because the DHS
-
isn’t just a DHS. It’s regulating something, right? So, that’s the intuition. And you
-
partition this all this way. And what it says here is CP times PPA, okay? So, that’s just
-
the correlation posterior between this DHS and this gene times how much weight you’ve
-
given it from the association data. And that way, you wind up building this model of this
-
gene posterior. So, if I sum all of these, all of the contribution of each DHS from the
-
SNPs going into this gene, I can get a sense of what the probability that this gene is
-
driving association in this region is. And I can do that for any gene.
-
So, I now derive a score basically for how likely this gene is to be pathogenic, if that
-
pathogenesis is mediated by DHS regions. And we know they’re enriched, so that’s a
-
reasonable hypothesis, okay? It’s not the only way to do it, but it’s one way to think
-
about this. And so, you have to solve a couple of technical problems to do this. One is,
-
you’ve got to correlate your DHSs to your genes. And so, that’s really simple. You
-
just observe if there’s a peak, and what the level of expression of a gene is, and
-
then you do a correlation, on-off versus level of expression of a gene. And you do that for
-
each DHS you find.
-
Two issues. First of all, you’ve got to decide what the same DHS is. And secondly,
-
you need measurements where you’ve measured both DHS and gene expression, okay? So, to
-
do this thing, we use an alignment approach. This is what real DHS data looks like out
-
of hotspot. These are peaks. This is an arbitrary part of the genome and your job is to figure
-
which ones of these represent the same element across samples. We’re not terribly good
-
at that as human beings. Fortunately, computers are a lot better at this than we are.
-
So, you can put it in a clustering approach and kind of decide that these look the same
-
that are a little jittered, but they kind of look similar. And then these guys are kind
-
of the same, but you’re may be a little less confident because there’s more spread.
-
And these guys are kind of the same as well, but there’s even more spread, okay? And
-
the way we do this is with mark-off clustering. It’s a way to cluster stuff. There are other
-
ways to do it. It work reasonably well. And the way you think about this -- oh, and that’s
-
gotten chopped up as well. That’s brilliant. Okay. So, one way you might want to do this
-
is to say, is this detectable? And so, you go into the Roadmap data, and fortunately
-
there are replicates.
-
And here’s my assertion. If I see a peak here in replica one of a tissue, then I should
-
expect to see that peak in replica two of a tissue as well, right? Biologically replication
-
just as we do in any other experiment. Really simple. And so once I decide this is my cluster,
-
that’s what comes out of the algorithm, you don’t just go and apply that mindlessly
-
to data. That’s not how you do analysis, right? You check and you see what you can
-
detect. And of course, the wider and the sloppier this peak is, the less likely it is to be
-
true. And so you can do a statistical test. And so, once you’ve decided what the cluster
-
is, if there’s a peak anywhere in that cluster, you mark that sample as a one. And if there’s
-
no peak, you mark it as a zero. If you have replicates where the labels somewhere over
-
here on that wall, you can then say, “Okay, do I" -- "if I see ones in both replicates
-
I’m going to score that tissue as a two. I’m going to score it as a one if there’s
-
only replicate." So, if its’ discordant. "And I must -- I’m going to score it as
-
a zero if there’s none there.” And then you can do a test.
-
So, I’ve done this without knowing about replicas. And then I add the information about
-
what goes with what and I ask, “Are they consistent?” So, if I get things like “Look,
-
in cell type one, I get a one. And in two, I get a one. I get all ones.” That suggests
-
this isn’t consistent. It’s not replicating. And if I get a lot of twos a lot of zeros
-
and very few ones, that looks consistent. So, it’s replicating. It’s either not
-
there or it’s there. And so I can do a statistical test. It’s not terribly important what the
-
test is. It’s a simple chi-square approximation. We do this over 57 tissue replicates. So,
-
from Roadmap. And we find that just feeding this in when we cluster, we can get about
-
a million out of 1.99 million. So, about 54 percent of our clusters pass are fairly stringent
-
threshold -- a fairly lenient threshold. And that’s because very often these things are
-
kind of diffuse. The clusters don’t really look good. And so, we’re probably not doing
-
great at the clustering, and it’s unreliable, right? There’s also a bunch of singleton
-
in these data that get thrown out because they don’t replicate. But most of this is
-
actually the clustering.
-
So, we can get about a million features about the genome. And we don’t worry about recovering
-
more stuff and improving the clustering. Right now, we’re just working with these million.
-
So, these other thing is, you’ve got relatively low power. And so, what’s nice about this
-
is this -- what you can clearly read here -- what you can do is estimate how much the
-
heritability you’re still explaining. So, this is just a sanity check. If you use all
-
of these clusters, it’s about 14 percent of the genome, and it explains a proportion
-
of heritability. And what I want to know is if I reduce this to the half of the clusters
-
that I’m using now, what proportion of heritability am I still explaining? And to a first approximation,
-
what you can see here is in red is all the peaks and in blue is just the clusters that
-
we define. Pretty much we’re capturing all of the signal. It varies as wiggle room. There’s
-
a little bit of error on these things, but we’re capturing just about all of the heritability.
-
But we’ve gone from 14 percent of genome to 8 percent of the genome.
-
So, rather than do the 500 base pair either side, which is what most of previous heritability
-
estimates have done, which a lot of the summary papers have kind of shown, “Oh, there’s
-
enrichment in DHS or in regulatory regions or whatever.” But they actually bracket
-
each feature by 500 bases. And so, they cover 50 percent of the genome. So yes, all of the
-
heritability is explained by 50 percent of the genome. I’m telling you that a lot of
-
the heritability’s explained by eight percent of the genome. So, it’s a lot bit more specific.
-
And so, the second challenge is to now correlate these guys, now that we’ve decided what
-
clusters are, to correlate them to gene expression. So, you need matched data. We use 22 sets
-
of matched DHS and exon array data from Roadmap again. And the problem is, there’s massive
-
inflation because gene expression data of course is highly correlated. And so you just
-
get this massive inflation in the expected distribution of these tests. And we can correct
-
this. We just go through and normalize it and basically, you kind of start off with
-
this massive inflation. I’m showing you lambda here. It’s supposed to be a nice
-
straight line here. And we can correct all of that out.
-
So, now that we have all of these statistics, we can go back and do our little approach.
-
So, now we have this part. We already have this part from credible interval, set mapping,
-
and posterior estimation. And we can now estimate gene-wide scores. And so, big red exclamation
-
point here you can see means this is really fresh, as in last Friday’s results. Hot
-
off the presses. Here is a region. It -- we’re talking about MS GWAS. This is actually the