Placeholder Image

Subtitles section Play video

  • [music playing]

  • >> Mary Engler: Well, welcome back from break and

  • I'm delighted to introduce -- after such an incredible

  • morning with such great speakers -- I'm delighted to

  • introduce our next speaker Dr. Bonnie Westra, who'll be

  • presenting Big Data Analytics for Healthcare.

  • Dr. Westra is director for the Center of Nursing

  • Informatics and associate professor in the

  • School of Nursing at the University of Minnesota.

  • She works to improve the exchange and use of

  • electronic health data.

  • Her important work aims to help older adults remain in

  • their community and live healthy lives.

  • Dr. Westra is committed to using nursing and health

  • data to support improved and better patient outcomes as

  • well as developing the next generation of nurse

  • informaticists -- informatistatcians.

  • [laughter]

  • Okay.

  • Please, join me in a warm welcome for Dr. Westra.

  • [applause]

  • >> Bonnie Westra: Is it potato or potato [laughs]?

  • [laughter]

  • So, I am just absolutely thrilled to be here and this

  • is an amazing audience.

  • It's grown since last year, so this is great.

  • So, today what I'd like to do is to relate the

  • importance of big data in healthcare to what we're

  • talking about today, identify some of the

  • critical steps to make data useful so when you think of

  • electronic health record data or secondary use of

  • existing data, there is a lot that has to be done to

  • make it useable for purposes of research.

  • Look at some of the principles of big data

  • analytics and then talk about some examples of some

  • of the science, and you'll hear a lot more about that

  • during the week in terms of more in depth on that.

  • So, when we think about big data science, it's really

  • the application of mathematical algorithms to

  • large data sets to infer probabilities

  • for prediction.

  • That's the very simple definition.

  • You'll hear a number of other definitions as you go

  • through the week as well.

  • And the purpose is really to find novel patterns in data

  • to enable data driven decisions.

  • I think as we continue to progress with big data

  • science, we won't only find novel patterns but in fact

  • we'll be able to do much more of being able to

  • demonstrate hypothesis.

  • One of my students was at a big data conference that

  • Mayo University in Minnesota was putting on, and one of

  • the things that they're starting to do now is to

  • replicate clinical trials using big data, and they're

  • in some cases able to come up with results that are 95

  • percent similar to having done the clinical

  • trials themselves.

  • So we're going to be seeing a real shift in the use of

  • big data in the future.

  • So when I think about big data analytics, what this

  • picture's really portraying is big data analytics exists

  • on a continuum for clinical translational science from

  • T1 to T4 where there's foundational types of work

  • that need to be done but we actually need to apply the

  • results in clinical practice and to learn from clinical

  • practice that it then informs foundational

  • science again.

  • When you look at the middle of this picture, what this

  • is really showing is that this is really what nursing

  • is about.

  • If you look at the ANA's scope and standards of

  • practice on the social policy statements, nursing

  • is really about protecting, promoting health and then to

  • alleviate suffering.

  • So when we focus on -- when we think about big data

  • science in nursing, that's really kind of our area

  • of expertise.

  • And what you see on the bottom of this graph is it's

  • really about when we move from data, you know, we

  • don't lack data.

  • We lack information and knowledge and so it's really

  • about how we transform data into information into

  • knowledge, and then the wise use of that information

  • within practice itself.

  • This was, I -- we were doing a conference back in

  • Minnesota on big data and I happened to run into this

  • graphic that just, you know, it's like how fast is data

  • growing nowadays?

  • And so what you can see is data flows so fast that the

  • total accumulation in the past two years is a zeta byte.

  • And I'm like, "Well, what is a zeta byte?"

  • A zeta byte is a one with 21 zeroes after it.

  • And that what you can see is the amount of data that

  • we've accumulated in the last two years equals all

  • the total information in the last century.

  • So the rate of growth of data is getting to be huge.

  • Data by itself though, isn't sufficient.

  • It really needs to be able to be transferred or

  • transformed into information and knowledge.

  • Well, when we think about healthcare, what we can see

  • is that the definition is that it's a large volume,

  • but it might not be large volume.

  • So when you think about genomics sometimes it's not

  • a large volume, but it's very complex data, and that

  • as we think about getting beyond genomics and we think

  • about where we're at, it's really looking at where are

  • all the variety of data sources and, it's the

  • integration of multiple datasets that we're really

  • running into now.

  • And it's data that accumulates over time, so

  • it's ever changing and the speed of it is

  • ever changing.

  • What you can see in the right-hand corner here is

  • that there -- as we think about the new health

  • sciences and data sources, genomics is a really

  • critical piece, but the electronic health record,

  • patient portals, social media, the drug research

  • test results, all the monitoring and censoring

  • technology and more recently adding in geocoding.

  • So as we think about geocoding, it's really the

  • ability to pinpoint the latitude and longitude of

  • where patients exist.

  • It's a more precise way of looking at the geographical

  • setting in which patients exist, and that there's a

  • lot of secondary data then around geocodes that can

  • give us background information about

  • neighborhoods that include such things as, you know,

  • looking at financial class, education.

  • Now it doesn't mean that it always applies to me,

  • because I might be an odd person in a neighborhood,

  • but it gives us more background information that

  • we may not be able to get from other resources.

  • So, big data is really about volume, velocity, voracity

  • as Dr. Grady pointed out earlier today.

  • Now as we think about big data, 10 years ago when I

  • went to the University of Minnesota and my Dean,

  • Connie Delaney [phonetic sp] had talked about doing data

  • mining and I thought, "Oh, that sounds

  • really interesting."

  • Because I was in the software business before and

  • our whole goal was to collect data in a

  • standardized way that can be reused for purposes of

  • research and quality improvement.

  • I just didn't know what to do with it once I got it.

  • And so I've had the fortune to work with data miners.

  • We have a large computer science department that does

  • internationally known for its data mining, and a lot

  • of that work was funded primarily by the National

  • Science Foundation at that time because it was really

  • about methodologies.

  • Well now we're starting to see big data science being

  • funded much more mainstream in addition now, NIH, CTSA,

  • et cetera, are all working on how do we fund the

  • knowledge, the new methodologies that we need

  • in terms of big data science?

  • So, an example of some of the big data science that

  • really is funded already today is that if we look at

  • our CTSAs.

  • So, there's 61-plus CTSA clinical translational

  • science awards across the country and the goal is to

  • be able to share methodologies, to have

  • clinical data repositories and clinical data

  • warehouses, and then to begin to start to say, "How

  • do we do some research that goes across these CTSAs?

  • How do we collaborate together?"

  • Or as we look at PCORnet.

  • PCORnet is another example.

  • So as we think about, there are 11 clinical data

  • research networks -- this may have increased by now --

  • as well as 18 patient powered research networks.

  • We happen to participate in one that has 10 different

  • academic of healthcare systems working together,

  • and it means that for our data warehouse we have to

  • have a common data model with common data standards

  • with common data queries in order to be able to look at

  • research such as we're looking at ALS, obesity, and

  • breast cancer.

  • And wouldn't it be nice if we could look at some of the

  • signs and symptoms that nurses are interested in

  • addition to looking at specific kinds of diseases?

  • When we look at some of the work that Optum health as

  • well as other insurance companies, they're really

  • beginning to take a look at amassing large datasets.

  • So Optum Labs happens to have 140 million lives from

  • claims data, and they're adding in 40 million lives

  • from electronic health records, so that provides

  • really large data sets for us to be able to ask some

  • questions in ways that we haven't been able to do.

  • I'm excited about reuse of existing data, and so

  • hopefully some of that enthusiasm will wear off on

  • you today because it's really a great opportunity.

  • Now, in order to use large data sources, what that

  • means is that we need a common data model.

  • We need standardized coding of data and we need

  • standardized queries.

  • What I mean by that is that if we don't ask about the

  • same variables and we don't collect the data or code the

  • data in the same ways, it makes it hard for us to be

  • able to do comparisons then across software vendors or

  • health systems or academic institutions.

  • And with the PCORI grant for instance, we're actually

  • looking at how do we do common queries so that if

  • we've got the common models, we can write a query and

  • share the queries with others to be able to pull

  • data out from multiple health systems in a

  • similar way.

  • So I'm going to talk about what I mean by that a little

  • bit more and show you examples of how we have to

  • be thinking in nursing about this as well as thinking

  • interprofessionally.

  • So when you look at PCORnet, they started with a common

  • data model one, then they went to version two, and now

  • this is version three that's being worked on at this time.

  • So you can see in the top left hand corner we have

  • conditions which might be patient reported conditions

  • as well as healthcare provider conditions, but you

  • can also see that down in the left hand corner that

  • there are also diagnosis.

  • So diagnosis are ICD9 coding that goes with it.

  • ICD10 is that unfolding.

  • Notice when you think about your science, where is the

  • data that you want for your science, and is it

  • represented in this common data model?

  • I would suggest that there's many types of data in the

  • common data model that's important to all of us as we

  • think about where we're going whether it's

  • demographics of medications or, you know, what are the

  • kinds of diseases that people have?

  • And there's also something missing as we move forward.

  • So, before I get to what's missing one of the things

  • that I want to point out that's critical is that in

  • order for PCORI or NCATS or any of these other

  • organizations to be able to do queries across multiple

  • institutions they have to have data standards.

  • And so when we look at demographics for instance,

  • OMB is the standard that we use for demographics.

  • When we look at medications, it's RxNorm for medications.

  • Laboratory is coded with LOINC.

  • Procedures are coded with CPT HCPCS or ICD9/ICD10 codes.

  • We also have diagnosis that have ICD9/ICD10 but in

  • addition, SNOMED CT codes or another type of standard.

  • And when we look at vital status we're looking at the

  • CDC standard for vital status and with vital signs

  • they're using LOINC.

  • So LOINC started as a laboratory data.

  • It's expanded to include types of documents.

  • It also has expanded now to include a lot of

  • clinical assessments.

  • So you're going to find the MDS used in nursing homes,

  • Oasis that's used in homecare, you'll see things

  • like the Braden or the Morse Fall Scales, and we're

  • expanding more types of assessments that are

  • important to nurses in the LOINC coding.

  • It also, by the way, includes the nursing

  • management minimum dataset, which the announcement just

  • came out this week that we've just finished updating

  • variables and they've been coded in LOINC, so if you

  • wanted to look at the work of Linda Aiken, for

  • instance, you'd find standard codes that can be

  • used across multiple settings.

  • So, our vision of what we want to see in terms of

  • clinical data repositories that are critical for nurses

  • is when we look at clinical data, we need to expand that

  • to include the nursing management minimum data --

  • the nursing management dataset.

  • What that means is we need to look at nursing

  • diagnosis, nursing interventions, nursing

  • outcomes, acuity, and we also have to take a look at

  • a national identifier for nurses.

  • Which, by the way, every registered nurse can apply

  • for an NPI which is the National Provider Identifier

  • so that we could track nurses across settings, just

  • like we do any other -- you know, the physicians or the

  • advanced nurse practitioners, but it's

  • available for any RN to be able to apply.

  • So, when we extend what data's available, if we

  • added in what are the interventions that nurses do?

  • What are the additional kinds of assessments that

  • nurses do?

  • That data is really critical for us to be able to do big

  • data science.

  • What you can also see is that there's management data

  • -- often times we think of that as claims data -- but

  • when you think about management data it needs to

  • go beyond that when we start talking about

  • standardized units.

  • Like if I see a patient in an ICU does it matter and

  • how do we even name ICUs?

  • Or psychiatric units?

  • At Mayo we used to call it three mary bry.

  • Well, how generalizable is that?

  • So there are ways to be able to generalize the naming of

  • units and that actually builds off of the

  • NDNQI database.

  • And then when we look at the workforce in nursing, Linda

  • Aiken's work I think is just stellar in terms of really

  • trying to understand, what are the things that we

  • understand about nurses because they effect

  • patients' outcomes, and they also affect our nursing

  • workforce outcomes as well.

  • So our clinical data repositories need to expand

  • to include additional data that's sensitive to nurses

  • and nursing practice, and it also needs to go across the

  • continuum of care.

  • Now, at the University of Minnesota, we have a CTSA

  • award, and our partner is Fairview Health Systems.

  • And so you can see here that as we built our clinical

  • data repository we have a variety of different kinds

  • of data about patients and about encounters that we

  • have available to reuse for purposes of research.

  • You can bet that the students that I have in the

  • doctoral program are all being trained to be big

  • data researchers.

  • It's like, "Stick with me kid, because this is the way

  • we're going."

  • So they use this but they also use, like, some of the

  • tumor registries or transplant registries as

  • another data sources as well.

  • And this data's available then for looking at cohort

  • discovery or recruitment observational studies, and

  • predictive analytics.

  • Now, when you look at what's actually in there and we

  • characterize that data, we basically have over 2

  • million patients just in this one data repository,

  • and we have about 4 billion rows of unique data, so we

  • don't lack data.

  • What's important to take a look at is, what is the

  • biggest piece of the pie here?

  • It's flow sheet data.

  • And what is flow sheet data?

  • >> Female Speaker: [inaudible]

  • Bonnie Westra: Yeah, it's primarily nursing data, but

  • it's also interprofessional, so PTOT speech and language,

  • dietician, social workers, there are specialized data

  • collection for, like, radiation oncology and that

  • kind of stuff.

  • But a lot of it is nurse sensitive data.

  • So one of the things that we've been doing as part of

  • our CTSI or CTSA award, is we're looking at this what

  • we call extended clinical data, and developing a

  • process to standardize how we move from the raw data

  • and mapping the flow sheet data to clinical data models.

  • And that these clinical data models then will become

  • generalizable across institutions the actual

  • mapping to the flow sheet I.D.s will be unique to

  • each institution.

  • One of the reasons this is important is I was just

  • working on our pain clinical data model this last weekend

  • trying to get ready to move it into a tool we call i2b2,

  • and we had something like 364 unique I.D.s for the way

  • we collect pain data, and that those 364 unique I.D.s

  • actually represented something like 54 concepts.

  • Or represented actually I think 36 concepts, and when

  • you do pain rating on a scale of 0 to 10, we had 54

  • different flow sheet I.D.s that are pain rating of 0 to 10.

  • Why don't we have one?

  • So, what that means is that we have a concept in our

  • clinical data model called pain rating, specifically

  • 0 to 10.

  • We also have the flack and the long baker and you know,

  • every other pain rating scale possible in the system.

  • But it means that we have to identify a topic like pain.

  • We have to identify what are the concepts that are

  • associated with that.

  • Then we have to look at how we map our flow sheets to

  • those concepts.

  • We then present it to our group in an interactive

  • process for validation before we can actually move

  • that into making it useful for purposes of research

  • -- researchers.

  • So we now have a standardized process that

  • we've been able to develop, and now we're moving it into

  • trying to develop open source software, so that if

  • you wanted to come play with us and you wanted to say, "I

  • like the model you're using and I want to use it, and

  • let's see if we can do some comparative effectiveness

  • research," that it's something that can be shared

  • with others.

  • And that's part of the nature of the CTSA awards is

  • that we develop things that can be used across so

  • everybody doesn't have to do it independently.

  • So here's examples of some of the clinical data models

  • that we've been developing.

  • So behavioral health, we have somebody who's a

  • specialist in that area who's working on a couple

  • of models.

  • Most of them are physiological at this point,

  • and we started that way because of another project

  • we're working with.

  • But one of the things that we started with internal is

  • we said, "What are the quality metrics that we're

  • having to report out that are sensitive to nursing?"

  • So when you looking at prevention of falls,

  • prevention of pain, CAUTI, VTE, and one other I can't

  • think of right now, but we really tried to take a look

  • at what are those things that are really sensitive to

  • nursing practice and then how do we build our data

  • models that can be used for quality improvement, but

  • also can be used then for purposes of research?

  • If we do certain things at a certain point in time, does

  • it really matter?

  • And then we've extended it to some other areas that

  • are, you know, based on what are the most frequent kinds

  • of measures that might be important to nurse

  • researchers to be able to work with.

  • Now, one of he things that the CTSAs do is many of them

  • use a tool called i2b2, and i2b2 can do many things, but

  • one of the first things it does is it provides you with

  • de-identified counts, of how many patients do you have

  • that meet certain criteria; so if you're going to submit

  • a grant, that you would be able to know whether you had

  • enough patients to actually potentially recruit.

  • One of the things that is missing out of it is almost

  • everything that's in flow sheets.

  • So, Judy Warren and colleagues proposed an

  • example of what would it look like in i2b2 if we

  • added in some of he kinds of measures that we're looking

  • at that are like review of systems of some of the

  • clinical quality measures.

  • So we're in the process of really looking at a whole

  • methodology of how to move that flow sheet data from

  • the data models in to i2b2 so that anybody could say,

  • "Oh, I'd like to study, you know, prevention of

  • pressure ulcers.

  • How many stage four pressure ulcers do we actually have

  • and, you know, what kind of treatments are they getting

  • and does it matter?"

  • And so that's an example of how this tool will be used.

  • Now, in order to make data useful it also has to be coded.

  • So remember the slide I showed you that showed we're

  • using RxNorm and we're using LOINC and we're using OMB

  • and we're using CDC codes?

  • Well, when we look at what code set should be used for

  • standardizing the data that we use that's not part of

  • those kinds of data, you'll see that the American Nurses

  • Association actually has recognized 12 terminologies

  • or datasets and they're done recognizing new ones.

  • Now it's just continuing to keep them up to date.

  • And so, the ANA just came out with a new position

  • statement, "Inclusion of recognized terminology

  • supporting the nursing practice within electronic

  • health records and other information solutions."

  • What that means is they say in that new paper that just

  • came out is that all healthcare setting should

  • use some type of a standardized terminology

  • within their electronic health records to represent

  • nursing data.

  • It makes it reusable then for purposes of quality

  • improvement and comparative effectiveness research.

  • However, when it is stored within clinical data

  • repositories or when we're looking at interoperability

  • across systems, then SNOMED CT is the standard that

  • would be used for nursing diagnosis.

  • So you might use the Omaha system of NANDA or CCC or

  • any of these, but it has to be mapped then to SNOMED CT

  • so that if I'm using the Omaha system and you're

  • using ICNP, that they actually can talk to each

  • other where they have comparable terms.

  • What the ANA has also recommended is that nursing

  • interventions, while there's many standardized

  • terminologies, actually use SNOMED CT for being able to

  • do information exchange and for building your data

  • warehouses if you're using different systems that you

  • want to do research with.

  • And that nursing outcomes would be used with SNOMED

  • CT, sometimes maybe LOINC, and that assessments be used

  • with LOINC, and I won't go into all the details

  • underneath that because it's more complicated than that.

  • Because sometimes the answers are LOINC and

  • sometimes they're SNOMED CT, depending.

  • So there's a lot that goes on behind the scenes, but

  • this is rally important because if -- and this

  • actually comes off of the ONC recommendations for

  • interoperability for clinical quality measures --

  • that's how these standards actually came about so that

  • it's consistent with the federal policy when we're

  • doing this.

  • So, ANA, it's on their website.

  • The URL was so long that we had permission just to put

  • it on our website and give you a short URL.

  • So if you want to learn more about it the URL is listed

  • down here.

  • So, another effort that is going on is that in addition

  • to some of the foundational work that we're doing

  • through the CTSA, is that there is a whole group

  • that's headed by Susan Matheny that is about how do

  • we build out an assessment framework in very specific

  • coding for the kinds of questions that we asked for

  • physiological measures?

  • So when we look at the LOINC assessment framework we

  • start with first physiological measures, and

  • then there's other things shown in orange called the

  • future domains that also have to look at what are the

  • assessment criteria that are documented in electronic

  • health records that need standardized code sets?

  • So there's a group that Susan Matheny is heading up

  • that includes software vendors, different

  • healthcare systems, people with EHRs that aren't the

  • same EHRs, and they're pulling together a minimum

  • set of assessment questions and getting standardized

  • codes for those minimum set of assessment questions and

  • they were just submitted to LOINC I think the end of

  • June for final coding and distribution in the next

  • release of LOINC.

  • And this group is continuing on to build out additional

  • criteria for assessment, so that we have comparable

  • standards across different systems.

  • Now, I mentioned that the nursing management minimum

  • dataset -- this was actually developed back in about 1997

  • recognized by the American Nurses Association and has

  • been just updated for two out of the three areas.

  • So in the environment you can see the types of data

  • elements that are included -- and this is very

  • high-level data elements -- there's a lot of detail

  • underneath these.

  • And you can see nursing resources.

  • Now, when this was updated we harmonized it with every

  • standard we could possibly find.

  • A lot of it has been NDNQI, so the Nursing Database for

  • Quality Indicators, but it's also been harmonized with

  • every other standard we could find so that there

  • weren't different standards consistently for these types

  • of variables.

  • It also -- if you've followed the Future of

  • Nursing -- Future of Nursing work from the IOM report and

  • the Robert Wood Johnson Foundation, it matches the

  • workforce data that they're trying to collect through

  • the national board -- state boards of nursing.

  • So again, if you're collecting data for one

  • reason that in fact you can actually use it for multiple

  • reasons when you're using a standard across the country.

  • So, there is a reference here.

  • You can go to LOINC.org, and if you look under news

  • you'll see the release that came out this last week

  • about this, and then you'll also see that if you go to

  • the University of Minnesota website that the

  • implementation guide is available that gives you all

  • of the details that you never wanted to know but

  • need if you're actually going to standardize your data.

  • So, the point of all this is that when you think about

  • using big data and you want to do nursing research, it's

  • really critical that we think about all of our

  • multiple data sources whether it's electronic

  • health record or if you're thinking about with

  • management minimum dataset for instance.

  • You're thinking about scheduling, you're thinking

  • about HR data, and that doesn't even begin to get

  • into all the device data and the personal data

  • contributed by patients.

  • So that's additional data, and think about what it's

  • going to take to standardize that in addition.

  • It won't be on my plate, but many of you might want to

  • actually do that because it's a really good way to

  • begin to move forward.

  • So the message that I wanted to leave you with on that is

  • there's lots of data.

  • When we think about nursing research that we are at the

  • very beginning of starting to say, "What data?"

  • And how do we standardize that data?

  • And how do we store and retrieve that data in ways

  • that we can do comparative effectiveness research with

  • that data or some of the big data science.

  • Just one example, I'm not going to cover today but

  • I'll talk a little bit tomorrow, is we're pulling

  • data our to electronic heath records to try to say, how

  • do we really understand patients that are likely to

  • have sepsis, and then there's the sepsis bundle,

  • that if you do -- you know, if you do certain types of

  • evidence-based practice quickly and on time, you can

  • actually prevent complications.

  • Well, we're pulling out electronic health record

  • data, and guess what?

  • This is really interesting.

  • We got an NSF grant to do this and so we said, "Well,

  • we're going to look at evidence-based practice

  • guidelines, nurses and physicians, well guess what?

  • The evidence-based practice guidelines for nurses aren't

  • really being used.

  • And so we're having to figure out how would you

  • find the data.

  • Not because nurses aren't doing a good job just the

  • guideline types of software wasn't used in the way

  • we thought.

  • So then we said, "Well, we'll look at, you know,

  • we'll look at certain data elements, and then we're

  • also going to look at physician guidelines and are

  • they being used?"

  • So, in order to know if you did something in a timely

  • manner, you have to know, when did somebody suspect

  • that sepsis began?

  • Do you know where that's located?

  • Maybe in a physician's note.

  • And so the best way to find out if patients are likely

  • to develop sepsis is nurse's vital signs and the flow

  • sheet data.

  • And so consistent documentation in those flow

  • sheet data becomes really critical.

  • And then if they're being followed and adjusted, you

  • have to understand things like fluid balance,

  • cognitive status, your laboratory data as well as

  • the vital sign data that's going on with that, and lots

  • of other stuff.

  • So this EHR data is critical in terms of being able to

  • really look at how do we prevent complications.

  • So I'm going to talk a little bit now moving into

  • more of the analytics.

  • So when we think about analytics there is a book,

  • it's free online.

  • This is not an advertisement for them, but it was one

  • that changed my life.

  • And so it's called, "The Fourth Paradigm of Science."

  • And it really talks about, how do we move into data

  • intensive scientific discovery?

  • And one of the things that I think is really interesting

  • is, how many of you have every read a book -- a

  • fiction book -- it's called "The Timekeeper?"

  • It is really a fun book.

  • The thing that's fun about it is it talks about before

  • people knew time existed, they hadn't picked up the

  • observational pattern thousands of years ago that

  • basically said that, "Oh, there is this repetitious

  • thing called time."

  • It then goes on to talk about the consequences for

  • us of how we want more of it, you know?

  • And so it's not always a good thing to discover

  • things, but, you know, our first science was really

  • about observations and really trying to understand

  • what do we notice?

  • You know, what's the empirical data?

  • We then moved into thinking about a theoretical branch.

  • So what are our models?

  • How do we increase the generalizability of our science?

  • From there we've moved into in the last few decades

  • computational branch which is really how do we simulate

  • complex phenomena?

  • And now, we're moving into data exploration or

  • something that's called e-Science.

  • So we can hear the term big data, or big data science.

  • E-Science is another term that's used for that.

  • So when you look at that, what you can see is that we

  • have data that's being captured by all kinds

  • of instruments.

  • We have data that's processed by software and we

  • have information and knowledge that's stored

  • in computers.

  • And so, what we really have to do is how do we look at

  • analyzing data from these files and these databases in

  • coming up with new knowledge?

  • And it requires new ways of thinking, and it requires

  • new tools and new methods as we move forward.

  • So foundational to big data science is algorithms and

  • artificial intelligence.

  • So how do we take a look at if this then that, if this

  • then that?

  • So it requires structured data, you know, so that we

  • can develop these algorithms to be able to come

  • to conclusions.

  • Now machines are much faster at processing these

  • algorithms than the human mind is, and they can

  • process much more complex.

  • So our big data science is really about the use of

  • algorithms that are able to process data in really

  • rapid ways.

  • Semi -- what we call -- semi-artificial.

  • Not totally like you just throw it in there and it

  • does it and it gives you the answer.

  • There's a lot more to it than that.

  • So there's some principles about big data science that

  • are important, and one of those principles is let the

  • data speak.

  • So, what that means is we often times will say, as I

  • take a look at trying to understand CAUTI is one of

  • the subjects that one of my students is working on.

  • She's really trying to understand, we have these

  • guidelines for how do you have this catheter

  • associated urinary tract infection, how do we

  • prevent that?

  • So if we follow the guidelines, why aren't we

  • doing any better?

  • And what's missing is we probably don't have the

  • right data that we're looking at.

  • So she's actually combining some of the management data

  • along with the clinical data to try to say are there

  • certain units?

  • Are there certain types of staffing?

  • Is there -- you know, how do staff satisfaction?

  • You know, how does that all play into all of this?

  • What's the experience?

  • What's the education?

  • You know, what's the certification, the background?

  • And so, she is throwing in more types of data and then

  • trying to let the data speak in terms of, you know, does

  • this provide us any new insights that we can

  • think about?

  • Another thing is to repurpose existing data.

  • So once you have data, 80 percent of big data science

  • is the data preparation.

  • I think it's closer to 90, but it takes forever to kind

  • of get the data set up because it's not like you're

  • collecting new data with a standardized instrument

  • that, you know, has all these validity and

  • reliability, so there's a lot of data preparation and

  • transformation that needs to go on.

  • So once you've got that done and you understand the data

  • and the metadata, that is the context, the meaning,

  • the background of why do we collect this?

  • What does it actually mean?

  • You know, give me the context of this.

  • Then we can understand, how is it collected?

  • Why was it collected?

  • What are the strengths of it?

  • What are the limitations?

  • When I first started in this, I worked in

  • homecare software.

  • There wasn't anything I didn't know about Oasis.

  • Because I learned a ton by making every mistake,

  • working with everybody I could, and understand

  • it thoroughly.

  • When I went to working with big health system data, I'm

  • like a novice all over again.

  • So once I get a good dataset set up believe me, I'm going

  • to be working with that forever.

  • And so you'll see some examples of that tomorrow on

  • a different talk.

  • So in big data science another thing that we have

  • to think about is that N equals all versus sampling.

  • So it's not necessarily about random sampling, it's

  • really about once you've got all the data, you know, how

  • does that effect your assumptions about what

  • you're doing in science?

  • And there's another principle called

  • correlations versus causality.

  • So, you know, randomized clinical trials are trying

  • to understand the why.

  • Why did this happen?

  • And what we're trying to understand and when we've

  • got big data is, you know, what's the frequency with

  • which certain things occur?

  • What's the sensitivity?

  • What's the specificity?

  • How do we understand the probabilities that go with it?

  • And so we're often times looking at correlations

  • versus trying to look at causation.

  • Big data's messy.

  • I've had a chance to work with our CTSI database where

  • they've done a lot of cleanup and standardization

  • and then I've worked with the raw data, same

  • software vendor.

  • I've certainly learned that once you have the data and

  • you clean it up, it really makes a difference.

  • And will it ever be perfect?

  • Absolutely not.

  • But we think our instruments are perfect, you know?

  • And they're actually not either.

  • So there is a certain probability that things

  • occur and you get a large enough dataset.

  • You know, it really makes a difference in how you work

  • with the data.

  • And then there's also a concept called data

  • storage location.

  • So, there are some people that think you should put

  • all the world's data into a central database and work

  • with it, and then there are others that do something

  • called federated data queries.

  • So federated data queries is where, like with our PCORI

  • grant, everybody has their own data.

  • It's modeled in the same way and so we can send our

  • queries to be able to do big data research without having

  • all the data in the same pot at the same time.

  • Another thing that's really critical is big data is a

  • team sport.

  • I can't say that enough.

  • If you ask me all the mathematical foundation for

  • the kind of research we're doing, I'm not the one that

  • can tell you that.

  • I work with these computer science guys that have very

  • strong mathematical background, and I get

  • educated everyday I work with them.

  • And so we need to -- and I also know from example that

  • they really don't understand clinical.

  • And so, you know, when we had a variable gender they

  • were going to take male and do male/not male

  • female/not female.

  • And it's like, you only have two answers in the database,

  • so why do we need four answers [laughs], you know,

  • for this?

  • But that's just a simple thing but they don't

  • understand, like, you know, what's a CVP, for instance.

  • I have to actually look some of that up now too as I'm

  • getting further away from clinical but it's really

  • trying to understand you need a domain specialist.

  • You need a data scientist.

  • A data scientist is an expert in databases, machine

  • learning, statistics, and visualization.

  • And you need an informatician.

  • So how do you standardize and translate the data to

  • information and knowledge?

  • So, you know, understanding all that database stuff and

  • he terminology stuff is really important.

  • As I said, 80 percent is preprocessing of the data.

  • And then there's a whole thing called dimension

  • reduction and transform use of data.

  • So, one of my student said, "Well, I want to use ICD9

  • codes so I'll ask for those."

  • And I'm like, "What are you going to do with them?"

  • And so she finally got down to what I really need to

  • understand is there's certain diseases that

  • predispose people to having CAUTI.

  • And so, I only need to be able to aggregate them at a

  • very high level to see -- and so it means you have to

  • know all your ICD9 structure and be able to go up to

  • immunosuppressive drugs for instance or other diseases

  • that predispose you to getting infections or

  • previous history of infections.

  • So, you don't want 13,000 ICD9 codes.

  • You really want high-level categories.

  • So it's learning how to use the data, how to transform

  • the data.

  • A lot of times we have many questions that represent the

  • same thing, so do you create a scale?

  • If your assumption for your data model is that you need

  • binary data, how do you do your data cuts?

  • You know?

  • So with Oasis data we use no problem or little problem

  • and moderate to severe problem because we need a

  • binary variable.

  • And so it's that kind of stuff that you need to do.

  • And then there's all kinds of ways of saying, how do

  • you understand the strength of your answers?

  • You can quantify uncertainties so you're

  • looking at things like accuracy, precision, recall, trying to

  • understand sensitivity, specificity, using AUCs to

  • try and understand the strength of your models.

  • So I'm going to quickly go through just a few examples

  • of how we're now moving into using some of these types of

  • analysis and some of the newer methods of being able

  • to analyze data.

  • So, one is natural language processing.

  • Another is visualization and a third is data mining.

  • What I'm not going to do is address genomics.

  • I wouldn't touch that one, it's not my forte.

  • So, natural language processing is really another

  • name for it is called text mining.

  • And that is, as we take a look at this, five percent

  • of our data is really structured data and the most

  • is not structured data.

  • So we really need to -- we really need to think about

  • how do we deal with that unstructured data because it

  • has a lot of value within it.

  • But, so an NLP can actually help us be able to create

  • structured data from unstructured data so that we

  • then can be able to use that data more effectively.

  • So, it really uses computer based linguistics and

  • artificial intelligence to be able to identify and

  • extract information and so free text data's really

  • the sources.

  • So when you think of nurse's notes for instance.

  • The goal is to create useful data across the various

  • sites and to be able to get structured data for

  • knowledge discovery.

  • And there are very specific criteria

  • for trustworthiness.

  • When I did my doctoral program and we wanted to do

  • qualitative research -- that was many years ago people

  • were a lot like, well that sounds like foo foo.

  • [laughter]

  • Well, now there is like, you know, really trustworthy

  • criteria and there's trustworthy criteria for

  • data mining as well.

  • So when you look at, how many of you have heard

  • of Watson?

  • Yeah, so when you think about Watson, Watson was

  • used initially tested with Jeopardy, you know?

  • And finally it beat human beings.

  • So now IBM is actually moving into how can we use

  • that for purposes of healthcare?

  • And how do we begin to harness the algorithmic

  • potential of Watson?

  • So, Watson is really an opportunity to begin to

  • think about big data science and do you know how they're

  • training it?

  • They're asking -- they're doing almost kind of like a

  • think out loud with physicians.

  • Like how do you make decisions?

  • You know, they're reviewing the literature to see what's

  • in the literature.

  • We need some nurses feeding data into Watson so that we

  • can get other kinds of data in addition.

  • But Watson uses natural language processing to then

  • create structured data to do the algorithms.

  • So when you think about another example, how many

  • heard of Google Flu Trends?

  • Yeah, so with Google Flu Trends, one of the things is

  • how do you mind data on the Internet?

  • What kinds of things are people actually searching

  • for that are things that are about flu?

  • What are the symptoms of flu?

  • What are the medications you take for managing the

  • symptoms of flu?

  • And what they found is that actually Google flu trends

  • could predict a flue epidemic before the CDC could.

  • Because it was based on patients trying to do their

  • symptoms, and then based on that, they could see that

  • there was this trend emerging.

  • Now when they actually looked at who had flu, the

  • reported flu and the Google trends, CDC outdid Google,

  • but it pointed to an emerging trend that

  • was occuring.

  • And actually what we're seeing now is we're doing

  • some of that kind of mining of data with pharmaceutical

  • reports looking for adverse events.

  • And so we're using the FDA has an adverse event

  • reporting system, and what they're finding is that as

  • they're looking at the combination of different

  • drugs that people are taking they're beginning to see

  • where adverse events are occurring through

  • combinations of different drugs that previously

  • weren't known.

  • So when you think about we do these clinical trials, we

  • get our drugs out on he market.

  • After the drug's out on the market it's like, how do

  • they actually work in the real population?

  • And I think Eric's presentation earlier with

  • that new graphic that just came out of Nature, that one

  • out of 10 or one out of 15 people actually benefits,

  • the question is how many people get harm?

  • And how do we know what the combination of drugs is that

  • could actually cause harm?

  • So there's some really interesting stuff that's

  • going on with mining data and looking at combinations

  • to try to understand, are there things we just

  • don't know?

  • So another area's looking at novel associative diagnosis.

  • When I first read this I'm like, "I don't get it."

  • And what it is, is that we're really trying to

  • understand what kinds of meaningful diseases co-occur

  • together that we previously didn't know?

  • So an example is obesity and hypertension.

  • That's a real common one.

  • We know that those two go together frequently.

  • But how many combinations of diseases that we just don't

  • understand go together?

  • So there's a team of researchers that compared

  • literature mining with clinical data mining and

  • what they did is with this massive dataset they looked

  • at all the ICD9 codes in a massive dataset.

  • So this person has these three or five or 14

  • diagnosis that all co-occur together and they said,

  • "What do we see in the literature of what diagnosis

  • co-occur together?"

  • Because they thought that they could validate commonly

  • known ones which they could and they could discover new

  • ones that needed further investigation.

  • Well, what they did is they looked at that, is that they

  • found there's very little overlap between diagnosis in

  • the clinical dataset and in the literature.

  • So the question is, is it that the methodology needs

  • to be improved?

  • Is it that we only know the tip of the iceberg of what

  • kind of things co-occur together?

  • Can we gain new insights about new combinations that

  • frequently co-occur together that can help us predict

  • problems that people have and try to get ahead of it?

  • Another example is early detection of heart failure.

  • So there was a study that was done and I won't

  • pronounce the name on this by this person and the team

  • and what they were really trying to do is can they

  • determine whether automated analytics having counter

  • notes in the electronic health record might enable

  • the differentiation of subjects who would

  • ultimately be diagnosed with heart failure.

  • So if you look at signs and symptoms that people are

  • getting, can you begin to start seeing early on that

  • this person's going to be moving into heart failure or

  • that their heart failure might actually be worsening?

  • So that you can anticipate and try to prevent problems

  • so that you can anticipate and try to make sure that

  • the right treatment is being done?

  • So they wanted to use -- they used novel tools for

  • text mining notes for early symptoms and then they

  • compared it with patients who did and did not get

  • heart failure.

  • The good news is, is they found that they could detect

  • heart failure early.

  • The bad news is people who didn't get heart failure

  • also had some of those symptoms.

  • So again, we're at the beginning of this kind of

  • science and it really needs to be refined so that we can

  • begin to get better specificity and sensitivity

  • as we do these algorithms that we're developing

  • for predicting.

  • Now visualization is another type of tool and, so as you

  • think about how do we understand massive amounts

  • of information?

  • So there's a lot of different tools for helping

  • us to be able to quickly be able to see what is going

  • on, and so these are just examples of visualization

  • not to read what the details are about this.

  • But what you can see is there was a study done by

  • Lee [phonetic sp] and colleagues where they were

  • trying to understand older adults and their patterns of

  • wellness from point A to eight weeks later in terms

  • of their wellness patterns.

  • But what they were really trying to do in this study

  • is to say, what kind of way can you visualize

  • holistic health?

  • And do you visualize holistic health and the

  • change in holistic health over these eight weeks by

  • using a stacked bar graph, you know, or one of the

  • other types of devices?

  • And then they had focus groups and they tried so

  • say, "What do you think about this?"

  • You know, "How well does that help you to process

  • the information?"

  • And so it helped them to be able to think about it --

  • it's really a cognitive science kind of background

  • of how people process information, what kind of

  • colors, how much contrast, what shapes and design help

  • people be able to process information?

  • So this is kind of an emerging area where we're

  • really trying to understand patterns related to

  • different phenomena.

  • Karen Munson for instance, one of my colleagues, has

  • been looking at this with public health data, and

  • she's looking at what are the patterns of care for

  • maternal child health patients?

  • Moms who have a lot of support needs from public

  • health nurses, and are there individual signatures of

  • nurses and how they provide care and are certain

  • patterns more effective, and with what subgroup of

  • patients are those patterns more effective?

  • So she's using visualization more like this stream

  • graphic over on the top left side here to look at

  • signatures of nursing practice over time.

  • So one of the things I find is that as we're doing data

  • mining, the genetic algorithms are increasing in

  • their accuracy and their abilities.

  • So if you think about the financial market, I don't

  • know about you, but I came back from a trip to Taiwan

  • one time, went to purchase something at RadioShack and

  • my credit card was declined.

  • And I'm like, "What do you mean my credit

  • card's declined?"

  • And they said, "It's declined."

  • And so I'd used it in Taiwan.

  • What I didn't know is that was an unusual pattern for

  • me and they happened to pick it up and they said, "Were

  • you in Taiwan?"

  • And I'm like, "Yeah, I was in Taiwan."

  • They said, "Okay, fine.

  • We'll enable your card again."

  • Well, it used to be that they would do a 25 percent

  • sample of all the transactions and be able to

  • pick up these abnormal patterns to try to look

  • for fraud.

  • Now they actually can process 100 percent of

  • transactions with fairly good accuracy.

  • So if they can do that with bank transactions, why can't

  • we do that with EHR data?

  • And part of it is they have nice, structured data

  • [laughs], you know?

  • In compared to what we're using.

  • So data mining is really about, how do you look at a

  • data repository, select out the type of data you want,

  • look at preprocessing that data, which is 80 percent of

  • the work, do transformation -- so creating scales or

  • looking at levels of granularity.

  • But then it uses some different kind of algorithms

  • and different analytic methods.

  • So up until I got to data mining on this graphic we're

  • really talking about traditional research in

  • many ways.

  • But when we get to data mining we're then looking at

  • all kinds of different algorithms that get run that

  • are semi-automated that can do a lot of process that we

  • have to do manually in traditional

  • statistical analysis.

  • And, in order to come up with results, the next step

  • is critical.

  • We can come up with lots of really weird results.

  • I can't remember the one that Eric showed earlier, or

  • maybe Patricia Grady did when she said, you know,

  • "Diapers and candy bars."

  • Or something like that.

  • But whatever it was, it doesn't make sense, and so

  • we really have to make sure that we're using our domain

  • knowledge in order to see, is this actually clinically

  • interpretable as we move forward?

  • So, data mining is also known as knowledge discovery

  • in databases.

  • It's automated or semi-automated processing of

  • data using very strong mathematical formulas to do

  • this and that there are absolutely ways of being

  • able to look at the trustworthiness of the data.

  • So we use -- a lot of it is sensitivities, specificity,

  • recall accuracy, precision.

  • There's also something called false discovery rates

  • is another way of checking out the validity of what

  • you're finding.

  • And there are lots of different methods, so some

  • of those methods are association rule learning,

  • there's clustering analysis, there's classification like

  • decision trees, and many new methods that are

  • emerging constantly.

  • So it's not like you can say data mining is just data mining.

  • It's like saying quantitative analysis, you know?

  • So it's lots of different methods of being able

  • to do this.

  • I think an example of data mining is the fusion of big

  • data and little babies.

  • So there was actually a study that was done looking

  • at all the sensory of data in a NICU and trying to

  • understand who's likely to develop infections and that

  • they were able to find that 24 hours earlier than the

  • usual methods of capturing continuous data from

  • multiple machines they were able to pick up who was

  • going to run into trouble and to head it off with the

  • NICU babies.

  • So, it has very practical applications.

  • Another example is looking at type 2 diabetes risk

  • assessment and really trying to understand with not just

  • association rules, but now we're moving into newer

  • methods of trying to look at time series along with

  • association rules and trying to see patterns over time

  • and how those patterns over time and the rules you can

  • create from the data will predict who's likely to run

  • into problems.

  • And so, some of the work that George Simon [phonetic sp]

  • has done with his group is really looked at survival

  • association rules and they substantially outperform the

  • Framingham score in terms of being able to look at the

  • development of complications.

  • So, in conclusion, big data are readily available.

  • We don't lack data.

  • The information infrastructure is critical

  • for big data analytics.

  • One of my colleagues I've done research with, she

  • said, "I just keep hoping one of the days you can just

  • throw it all in the pot and something will happen."

  • [laughter]

  • And it's like, that is not what big data analysis is about.

  • There are rules just like there are for qualitative

  • research or quantitative research.

  • And that the analytic methods are now

  • becoming mainstream.

  • So 10 years ago it would be really hard to get data

  • mining studies funded unless you went to the NSF.

  • Now that's getting to be more and more mainstream.

  • As a matter of fact, if you look in nursing journals and

  • you look for nurses who are doing data mining, you won't

  • find a lot out there yet.

  • So it's still just really at the beginning, but at least

  • we're starting to get some funding available now for

  • doing it.

  • So, one of the implications though out of this that we

  • really need to be thinking about is how are we training

  • our students, the emerging scientists.

  • How are we training our self here today?

  • But how are we training the emerging scientist to really

  • be prepared to do this kind of science of big data

  • analysis, and the newer methods that need to be done?

  • How do we think about integrating nurses into

  • existing interprofessional research teams?

  • So, I don't know about you, but how many nurses do you

  • know that are on CTSAs that are doing the data mining

  • with nursing data as part of the data warehouse?

  • Or on PCORI grants where they're building out, you

  • know, some of the signs and symptoms that nurses are

  • interested in are the interventions in addition to

  • the interprofessional data.

  • And so, it's really important that we take a

  • look at making sure that we're including nurse

  • sensitive data as part of interprofesional data and

  • that means that we really need to be paying attention

  • to the data standards, you know?

  • So that we are collecting consistent data in

  • consistent ways with consistent coding so we can

  • do the consistent queries to be able to really play in

  • the big data science arena.

  • So with that, I'll stop and see if you have

  • any questions.

  • I think we have one minute [laughs].

  • [applause]

  • We have a question over here.

  • Okay, so the question is how do you find the colleagues

  • like in computer science who can really help you?

  • Well, I tell you, I was really ignorant when I started.

  • I actually worked with somebody from the University

  • of Pennsylvania the first time I did it because I

  • didn't know any data miners at the University of Minnesota.

  • And, then I got talking with colleagues who said, "Oh, do

  • you know so and so who knows so and so?"

  • And then I started actually paying attention to what's

  • being published at the University of Minnesota.

  • It turns out that Vipin Kumar, who's head of the

  • computer science department, is actually one of the best

  • internationally known computer scientists

  • actually, he and Michael Steinbach, one of my

  • research partners, have their own book published on

  • data mining for the class that my students take with

  • -- along with the computer science students.

  • So, one, start with looking at -- if you look at some of

  • the publications coming out of your university, it's the

  • first place to start to figure out if you have

  • anybody around who can do data mining.

  • And I just didn't even know to think about that when I

  • first started.

  • So, it's a good way to start.

  • Part of it is playing attention to there's a

  • number of -- if you go to Aimia for instance there's a

  • whole strong track of data miners that have their own

  • working group at Aimia.

  • Also, there's a lot of data mining conferences going on

  • and so if you just start searching for -- I mean,

  • personally I do, I would do data mining and University

  • of Minnesota in Google, and that's a really fast way of

  • finding out who's doing that as another strategy to try

  • to find partners.

  • And they were thrilled to death, believe me, to get

  • hooked up with people in healthcare because they knew

  • that was an emerging area, big data.

  • They just knew that they didn't know it, and I didn't

  • know what they knew so together it made a

  • good partnership.

  • Okay, thank you.

  • [applause]

  • >> Mary Engler: Thank you, Dr. Westra, that was

  • just wonderful.

  • [music playing]

[music playing]

Subtitles and vocabulary

Click the word to look it up Click the word to find further inforamtion about it