MDI Data Blending Event
Articles,  Blog

MDI Data Blending Event

okay hello my name is Lisa Singh I’m a
professor in the computer science department and a research professor at
the massive data Institute I’d like to start by welcoming all of you to
Georgetown and to this beautiful bioethics research library it’s one of
the nice venues here at Georgetown today’s event a data blending tackling
the obstacles is sponsored by the massive data Institute in the School of
Public Policy so this event is part of a larger scale effort at MDI to think
about innovative ways to answer research questions by combining or blending both
traditional and new forms of data more organic data where the original intent
of the data and the data generation process differ from the questions that
are being that the data are being used to help answer so while the increase in
the volume of data available might seem like a panacea for researchers these new
forms of administrative data as well as social media data have generated
challenges around data quality representativeness and validity concerns
regarding consent and issues around the lack of universal data standards and
common data taxonomies today’s discussion will give us an opportunity
to get a range of perspectives on these issues from academia corporate and
government officials and researchers so we’ve designed the event to be in two
distinct parts the first part is a panel discussion where scholars will share
case studies and possible solutions for different challenges here we’ll see more
of a focus on organic data sets that are generated from social media sites the
second part of the event will be a conversation that highlights different
data blending projects that are being undertaken in government agencies here
the focus will be more on other types of organic and administrative data our hope
is that the ideas and directions that are shared today can help to help us
develop a set of best practices for undertaking different types of large and
new data blending projects so let’s get started
let me introduce Jonathan Ladd, he’s an associate professor in the School of
Public Policy and in the government department at Georgetown University he’s
also a non-resident senior fellow at the Brookings Institute and serves on
the Executive Council of the elections political public opinion and voting
behavior sections of the American Political Science Association and he’s
written extensively on public opinion polarization and media distrust
including a book entitled why Americans hate the media and how it matters
Dr. Ladd will be the moderator of the data blending research plant and also
come on up and maybe we can also have all the panelists come up at the same
time thank you um thank you for a very nice and overly generous introduction
I appreciate that we have a really great panel to start off the program this
afternoon I’ll start off with introduction introducing our wonderful
panelists here and I’ll introduce them in the order they’re going to speak the
procedure we’re going to follow is they’re gonna speak about five to seven minutes and then
we’re going to and then I’ll ask a few questions to start discussion and then
we’ll have questions from the audience so our first speaker is going to be
Quynh Nguyen who is an assistant professor of epidemiology and biostatistics and
the School of Public Health at the University of Maryland College Park her
PhD is in epidemiology from the UNC Gillings School of Global Public Health
and her specialization is in social epidemiology very happy to have her here
and after that we’re gonna have Josh Pasek he’s gonna speak he is an
associate professor of communication studies and a faculty associate the
Center for Political Studies at the University of Michigan part of our
wonderful group of Michigan guests here today and as well as he’s aCore Faculty for the Michigan Institute for Data Science at the University of Michigan – for data science
and his PhD is in communication from Stanford after that we’re gonna have
Zeina Mneimneh who is going to speak to us she is also one of our Michigan
guests she is an assistant research scientist at the Survey Research Center
at the University of Michigan she has her PhD in Survey methodology from the
University of Michigan and involved in a ton of other projects including she’s
director of the World Mental Health Data Collection Coordination Center and
she’s the chair of the executive committee for the International
Comparative Survey Design Initiative and finally we’re going to finish it off
with he told me just to call him Guha (Ramanathan Guha) I wanted so I’m just going to do
that just Guha – he is the creator of widely used web standards such as RSS
RDF and he’s also responsible for products such as Google Custom
Search he’s also a co-founder of He was a co-founder of Epinions and Alpiri. And until
recently he was a Google fellow and a vice president of Research
at Google he has his PhD in computer science from Stanford so he’s gonna be
our fourth speaker very excited for for all the speakers I will now hand it over
to Quynh who will talk first here we go and let’s see if I can project them to
the mic okay thank you so much for the invitation I’m very happy to be here so
I’m gonna be trying to cover a lot so we’ve been working on this for about
four into our fifth year as the social technologists the motivation for doing
this work is to try to get contextual variables that beyond social demographic
characteristics available from the Census and in trying to predict health
outcomes so what we’re interested in is what are some area-level characteristics
that are related to health outcomes and I’ll show you some of the
characteristics that we derive from Twitter so one project this is the newest
one actually is trying to look at sentiment and racial attitudes using
Twitter data and we to kind of get like a data set of potentially race related
tweets we use we built a about 500 different race terms like a list of 500
race terms some of them including just normal race categories used by
government agencies and then others include racial terms so from that we
were able to collect about two million geotag tweets and then an additional 31
million that are there they’re not geo tagged but they have place
characteristics so these tweets come with like a bounding box and a text
description but the bounding box we found is basically the size of a state
so for those place characteristics the only the largest the smallest area is
the state level and we’ll show you how we link it to health so on these tweets
we implement sentiment analysis and we’re finding to use training data we
use training data from other groups as well as providing training data from our
own group and we’re finding quite satisfactory levels of accuracy around
90 percent f1 score around 84 so we’re quite happy with the sentiment analysis
room a caveat is that comparing geotag versus non geotag tweets we see very
different patterns so I’m sorry if it looks kind of small
this is by state so if we’re looking at prevalence of negative race related
tweets we kind of see for a lot of patterns a lot of states it looks like
it’s increasing but if you restrict it to geotag tweets a lot of the states
looks like they have decreasing prevalence of negative tweets so not
saying patterns at all and geotag tweets make a very small proportion of tweets
because they’re an opt-in feature so on your phone you have to turn on location
and also enable certain settings in your in your Twitter profile so it’s a
multi-step and it’s it’s not like it’s opted in rather than by default so we
have about 5% of people or less with you tagging their tweets and hence their
representation it’s going to be a concern in selection bias definitely can
influence the results so this is a map of just using tweets with of V
state-level characteristics so it’s a map of some states having more negative
racial elated sentiment at least according to tweets so you see darker
blue means more negative sentiment so kind of the South in part of the East
Coast and so why does that matter what’s interesting about characterizing race
attitudes and so one of our initial cuts is to look at birth outcomes because
pregnancy is a very defined period and birth outcomes can be influenced by
stress and so we’re finding that states with a very high negative sentiment for
instance have an increased risk of low birth very low birth and preterm birth
and what was surprising is that we see this in the full population as well as
when we’re restricting to minorities so initially we thought we are only gonna
see this lineup for minorities because they are you know potentially the top
like the target of race related tweets they’re about minorities but it seems to
be elevating adverse outcomes for the population and so we when we’re looking
across subgroups we also looked among non-hispanic whites and we also continue
to see adverse associations between more race-related hostility and more adverse
birth outcomes and this is using violence statistics, so we love the
Twitter data is one thing but then to related to health outcomes
we need other data so this the health outcomes comes from the 2015 mentality
file and it has characteristics like maternal age whether mom smoked
education status body mass index prenatal care so controlling for all of
that as well as state-level characteristics like state expenditures
on welfare racial composition median household income so controlling for that
we see you know a slight increase in adverse outcomes we’ve also looked at
CVG outcomes so places with more happy tweets seem to have lower hypertension
stroke MI and conversely states with you know more sad tweets have increased
risk of these outcomes we derive other measures from Twitter it’s for instance
happiness so just the overall sentiment of tweets for all tweets general tweets
and then tweets that mentioned physical activity tweets that mention food and
then the characteristics around through like caloric density whether it’s fast
food or healthy food and before going into this we have these characters a
priori we didn’t know whether would link to any health outcomes and controlling
for senses characteristics including population density percent white median
age we need a household income we’re seeing that this was so areas are
happier that mentioned physical activity that mentioned healthy foods those are
related to lower chronic conditions and we’re using data which the
surgeon has the zip code and so these are individual level data with Twitter
being the area level predictor and that’s kind of a snapshot of the kind of
work we’re doing all right all right so you guys might not know
this but there’s this funny thing that a blender company did where they sort of
had this will it blend on various different devices and it felt quite
fitting to use for thinking about data blending so this is an iPhone that
somebody was blending in one of those commercials so what I what I’m gonna do
is is present sort of a little bit more on a theoretical level here trying to
think about when it is that that some of these techniques to try to bring various
forms of data together might work and when they might not right and so if you
if you want to start that activity and you’re like me sort of a survey
methodologist in my training and other stuff you start off by thinking about
where we are in our current paradigm right and and our traditional paradigm
for understanding things about society and for understanding relations between
things and society is Survey Research write and that paradigm involves trying
to understand attitudes and behaviors by asking carefully constructed questions
of a sample of people that are carefully designed and selected to represent the
public as a whole right and so you know the the trope is that they call you up
in a you know and annoy you during dinner to be able to successfully do
this but there are a variety of challenges that have plagued surveys in
particular in the last number of years and that response rates have been
declining it’s not unusual right now and a lot of reasonably high quality but non
federal surveys to have response rates as low as six to five or six percent the
costs are increasing even to keep the the rates at around that and coverage
challenges are becoming increasingly problematic right you now need to figure
out how to combine landlines and cell phones you need to do a whole bunch of
things like that to make a reasonably representative survey and so it’s
increasingly difficult for those reasons to translate from respondents as a whole
to the population right that you’re making a variety of assumptions when
you’re doing so when and so this raises a question of under what conditions
might you be able to use other kinds of data right to either compliment or
supplant traditional survey research and come to
many of the same inferences that we’ve had from those kinds of data and I’m
going to talk very briefly but I’ve done work sort of in three different areas
this one is just using non probability surveys to do some of this work another
is looking at consumer file marketing data so the stuff you can purchase from
companies to try to find out about people and the third is using social
media data so I’ve done various different projects using all of these
and I’m happy to talk more about my thoughts on any of them um but if you
want to think about the different kinds of insights that you might get right on
an individual level at least right if you’re thinking about sort of adding to
what we know about individuals in surveys well there’s information that
you can get about the respondents from various kinds of supplemental data right
there’s information you could get about potential respondents so maybe about the
people you don’t reach in a survey that might be able to tell you about who else
was in your sampling frame or who you tried to reach and systematically what
individuals you were able to were not able to get right and you’re able to get
a bunch of additional data about the specific people who respond as well
right there’s also various kinds of insights
that you could expect to get on an aggregate level from these kinds of data
to differing degrees so you might be able to find out things about
populations if you have a good additional source of data even if you
don’t necessarily have a probability sample right so you might be able to
tell the distribution of some particular aspect in society you might be able to
think about sort of other forms of data that might let you tests large-scale
social hypotheses or you might be able to actually measure changes over time
even if you can’t necessarily get exactly an accurate point estimate on
various things but the real concern here is that our ability to make inferences
from novel forms of data whatever they may be right depend on how well we
understand the sample that we’re getting and the process that’s actually
generating the data and this presents a series of theoretical and empirical
challenges that we have to get around if we want to use these data for for any
sort of substantive purposes in addition to traditional surveys right so if we
think about when alternative data sources are going to be valuable well
we need to think about for instance who’s providing the data and under what
circumstances so if you think about something like social media data those
data are being provided by people that are making a decision to actively post
about a particular subject at a particular time it’s a very different
process from having your attitude on a particular matter elicited through the
use of a targeted question to you right people might never happen to talk about
some issue that we care about or might only talk about it when it’s in the news
or other things like that which is going to lead to very different circumstances
when something’s going to come up the data themselves are often quite
different in nature that ranges from things like who they describe to what
exactly is happening there and there are a variety of practical ethical and
analytic differences that you need to deal with with how those data are being
used right it’s not ethically the same thing to be you know scraping data from
somebody’s cell phone as it is to be looking at data that people have given
explicit consent to providing right what I’ve found across a bunch of different
data sources that I’ve just you know that I’ve put next to surveys in terms
of doing this is that there are often a bunch of gaps so for instance with
consumer file marketing data when you purchase data from you know these
companies what you end up finding is there are some big issues with accuracy
the extent to which the data exists for certain individuals versus other
individuals is highly variant so there are huge missingness issues and the data
cover very unusual groups of people and so you can end up with circumstances
where you find yes you know a whole lot about people who own their home and
almost nothing about who actually rents their home even though those are
technically separate categories of the same variable which seems nonsensical
into you realize that home renters don’t fill out warranty forms to the same
frequency and thus their data is missing right um with regard to social media
data typically it’s common to find that you can create individual linkages with
social media data for about 40% of respondents Pew did much better in
their recent thing and I only just heard about this but a lot of people don’t
post on topics of interest so when I’ve been looking for you know posting about
you know political topics for instance you end up
finding that most people don’t end up talking about it and it in fact in the
small set of data that I was able to link Democrats talked about Republicans
and Democrats more than Republicans talked about either one and so your
baseline initial assumptions about what you are measuring can often be wrong and
those individual posting tendencies vary by types of individual which can create
a whole bunch of problems right um if you think about nonprobability samples
survey data here you’re moving to the aggregate level right can you make the
same overarching inferences on what we’ve found across many studies is that
for distributions of variables you’re gonna be off by an enormous amount a lot
of the time relations between variables do better
but still aren’t perfect and there are huge differences when it comes to trends
over time right with social media data distributions of variables are a
disaster we don’t know how to calibrate social media metrics to tell you about
population parameters we have discovered some serious problems with coming up
with things like trends over time in ways that would mirror what we end up
getting off of surveys the only thing where I’ve found we seem to be getting a
decent grasp is if you want to identify something that you could sort of later
use a survey to track like trying to figure out what events are happening
right you can find that stuff and then plug it into a survey or something of
that like but that makes a lot more sense I’m and argue in this regard so my
argument then is that so far most of the techniques provide very different
answers to questions than surveys would core questions about data quality and
representativeness sort of need to be answered and often aren’t answered in
blending exercises and if we want to think about where there’s utility here
we need to really look at those narrow cases where we seem to be getting some
leverage so things like identifying events understanding relations between
variables when the data actually provides an ability to do so and
thinking about sort of when the ancillary data are high enough quality
that we can actually use them to supplement at the individual level thank
you I was going to say Lisa said then the
next speaker is I think there was a little bit of measurement error there
not too much but a little bit I know I know but that John is not the only one
talking okay so let me see if I can blow this up because I don’t usually use Mac
so how do you slideshow play from start okay um good afternoon everyone
first I’d like to thank Lisa and her group for organizing a much-needed
meeting we’re all kind of like struggling with the same issue of how do
we blend data and a bunch of us are doing very similar things that I think
it would be nice to come in one room and at least talk about some of the issues
that we’re all facing I’m very pleased to be here and I’ll specifically be
talking about a some of the research finding that we have been doing as part
of our collaboration between the massive data Institute and Georgetown here and
in Michigan so we have a group of researchers Josh as one of them and
there’s many of us here we are doing joint work trying to really try to
understand more about data blending and I think we you know all of us you know
are barely scratching the surface we really don’t have much understanding yet
and we’re trying to you know get a little bit insight of what that means
I’m gonna specifically zoom into one specific type of blending I’m gonna be
talking about micro data blending which is when you have a survey data called
record and you’re trying to basically link it to a Twitter account by taking a
consent from the respondent josh has talked a little bit about it and he
mentioned the you know the 40% concentrate I’ll talk more about well
you know what we know in general concentrates are and what predicts some
of this concentrates so I don’t really need to go through all of this slide
in detail because I think Josh has covered many of it but basically we all
know that social media data is available and I say available because some of it
is more available than others right and if you look at the recent reports
published by Pew the rates are increasing over time though there
has been relatively stable in the past few years so since 2016 most of the
penetration of social media has been stable and you could see the rates here
and now though they are available and though some of them we can explore we
really don’t know much about our properties exactly what Josh said we
really don’t know much about you know I would say more specifically we don’t
know much about their measurement property or the data generation process
because we really know about who they represent and because of all the
research that you know Pew has been doing and others we know that they are
not representative of the U.S. population there is definitely social
demographic differences in terms of social in terms of comparing U.S.
population to twitter population so I think what we when I think of data blending I think of it as a tool to understand that data generation and the
property before I jump into all we can blend it or not you know I think of it
what kind of things we can do to understand the properties of a data
first then we say oh okay good now that we understand it yes we can blend it and
I think what we will really figure out is like what we know what we’ve seen in
many of our survey research project it’s very contextual it might work for some
it might not work for others there’s so many different variables that aren’t
situations where if it doesn’t replicate in one situation it doesn’t mean that it
doesn’t you know work for others but we definitely need a lot more research to
be able to figure when does it replicate if at all and if it doesn’t then that’s
not great so so when I think of also data blending one thing that comes to
mind is also whether we can use one source of data to validate the
predictions from another source right so when we blend two data sources we all
know for example that Twitter you don’t have a lot of social demographic
characteristic attached to it most user whether it is geo coded data or other
things most of this information is missing so by being able to blend and
attach data to it we can at least try to see whether we could predict some of
those social demographic characteristics so that we can do more on group
differences by using only Twitter data and many people have done it you know we
are getting good accuracy relatively speaking on gender and age age if you’re
doing large bins but we’re really not there even yet in terms of getting other
predictions you know getting higher accuracy on other
predictions also when I think of it I think of you know trying at least to see
what they’re combining multiple data sources you can gain more predictive
power to understand an outcome and there was a recent paper that has just been
published by Murphy and his colleagues about whether Twitter data that you could
predict if I using Twitter and survey data you could predict at the sentiment
stored using e-cigarettes and they found you know that adding Twitter data above
and beyond the server data improves their prediction of attitudes toward
E-cigarettes so this is a example that works that doesn’t mean it’s gonna
work everywhere but you know maybe one example that we could see some success
to some extent okay and then we can also think about it whether you know can we
collect data a different different combination of data in different waves
especially if we have panel data right so you could think maybe I do survey
data collect survey data and wave one and wave two survey sorry survey and
Twitter data or social media data in wave one and wave three and in between
maybe I can reduce the cost and only do one of these sources of data with the
aim of maybe using one source – maybe impute or know a little bit in between
waves about respondents assuming there is some correlation between the data
which is still to be need to be figured out so but for us to be able to do many
of those things you know one type of linkage is the micro level micro linkage
can I get consent to like and you know some people say why do we link if the
data is available for public there’s many reasons why we need to consent you
know one of them as many users don’t read the Terms of Service we are
basically predicting above beyond and above we’re predicting information about
the user more than we are they are sharing explicitly sharing I you know I
mean I can predict their gender even if they don’t know their gender and so on
so forth ideally because I have two minutes left I want to go into the meat
of it and I want to talk about what we what we found in terms of what predicts
consent right so maybe let me talk a little bit about do we have any existing
framework of consent for social media we do
there hasn’t been any comprehensive frameworks that says what what effects
predict what effects concentrates what are predictors of consent we have a lot
more on administrative linkage and some of the issues apply to social media some
of it don’t apply you know including for example this survey design feature
there’s any there’s new papers published by people from Britain that says well
definitely mode effect consent right like if you get try to get consent to
linked us to Twitter data and face to face you get that 40% rate that Josh
said if you’re doing it on the web you’re getting much lower anywhere
between twenty to twenty thirty percent the panel that that was mentioned the
JFK panel the one that just published by Pew it is on a panelist they got 90
percent concentrate but it’s on a very different group of population we just we
have to keep in mind the different populations that we’re talking about
okay let me jump about two case studies that we’ve done quickly the first case
study is a web survey that we conducted in collaboration with researchers from
Belgium it’s a web college among college students web survey on mental health we
asked them if we can get their Twitter handles and we got what we usually see
24% concentrate when you are asking on the web right so this survey wasn’t
designed to link Twitter data it was designed for another purpose they added
those questions for us but what we were able to do is we were able to add
questions on the frequency of use so what we found is that the more social
network sites you use and the more frequently you use Twitter you are more
likely to consent we think it’s the relevance issue which we’ve seen later
if it’s relevant to you you’re gonna consent more it didn’t seem that the
type of use of Twitter matters whether you read or retweet or just you know
don’t report any use it’s not related to consent we also find something
interesting here which is the more you report symptoms of alcohol use the more
likely you consent now I don’t think it’s related to alcohol of course not
because you’re drunk you’re consenting but it mainly because both of them are
correlated with third variable probably you know that you just tell you know
report sensitive information you don’t really worry about much privacy when
you’re reporting so there is a third construct there that is you know
affecting that relationship the second case study
from a national at the space sample of the US it’s a mail to web not to web to
mail sorry there is a type of their mail to web we mailed these link and then
they the respondent clicked on a web and on the URL and went to do the survey
again we found same thing about 27% consented in this one we didn’t have
frequency of use measure we it was basically a survey on also on health but
it has a bunch of other factors it has has multiple it has a scale that looked
at helping behavior whether respondents are more likely to engage in a helping
behavior such as Island a I gave money to a stranger or I help somebody find a
job and we found that there is a marginal correlation between helping
behavior and consent which has been reported in the literature with other
type of linkages to we found that there is some association with religious
identification if you identify as a Jewish you’re less likely to consent if
you are more spiritual you’re more likely to consent we still don’t know
the mechanism it could be you know your more trust you know you trust people
more we really have not yet dug deep into the mechanism of those effects and
if you identify as a Hispanic you’re more likely to consent and then what we
wanted to look at we wanted to look at other than all of those social
demographic characteristic and frequency of views that predict consent that I
showed you we wanted to look when the consenters and Non-consenters differ on
health outcomes right so most all of those surveys are the two surveys are
heavily asking about health outcomes and basically what this graph is showing you
and this is these are the variables or the health outcomes that we got from the
address based sample these are from the college survey and what really shows you
is consenter and non-consenters seem to be the same on all of those health measures
which is a good thing to some extent right if you are interested in in
basically focusing on consenters right now but the big question remains did we
what did we found from their Twitter data was any of those measures
associated we have not looked at this yet we just got IRB approval to get
the handles they were collected because we are you know we’re using data that
somebody else collected so the next step for us is basically to look at what we
glean from the tweets and to see whether there are any
associations between some of those measures we don’t know I mean – Josh
and we all have our skepticism are we ever gonna capture something on a smaller
sample size that we can able to link to those specific outcomes but we’re also
gonna look at a bunch of other metadata and try to understand whether some of
this metadata might be associated with some of those outcomes and we would
beyond my time I’m very sorry happy to talk about any of those aspects and the
discussion thank you thank you for having me here
this thing is okay you don’t just powerpoint where I come from we have this thing called a browser
which is somewhere here nope there we go so I’m going to talk about this artifact
that we’re building called Data Commons I’m still a work at Google so just to
set some truisms for context data powers everything today science journalism
health studies everything you’re the last audience I need to make that point
too there’s a ton of data data med data worse and the big one census
BLS CDC FBI NOAA hundreds of thousands of files CSV different schemas and this
usage of using the data the current model is a student researcher goes in
forages for the data tracks down the assumptions pulls it down cleans it up
you know compiled the sources figures out where to store it high upfront costs
the really interesting things show up when you start joining different data
sources and that is really expensive they do this they write their paper the
data goes away the next person starts from scratch the goal is to build what
we think of as Google for Data – Google is great because it allows us to pretend
that the entire web is on a single site what would Google for data mean it’s for
useful programs if you could enable a developer to pretend that enough not all
but enough of the data that they are that pertains to their domain is in a
single database single schema single API that would be good the data model that
we like is something we call a knowledge graph it’s an ancient data model from
the world of knowledge representation in AI is basically that the it says that
the world is made of set of entities with relations between them and these
entities have attributes this is a simple fragment of a knowledge graph
it’s about this musician Tori Amos it says that she’s a musician she is born
in the city Newton which is a city in North Carolina and so on
this unremarkable knowledge this is an unremarkable fragment of a knowledge
graph except for the fact that different parts of it come from different sources
now imagine having to write a piece of code that goes to these different
sources one by one and puts this graph together for you
versus you just have a single API to navigate the graph the long-term vision
for Data Commons is imagine if we could pull together all these data sources
everything from sensors to BLS FBI NOAA CDC Landsat wiki data EPA a whole
bunch more growing a bunch more and that anybody could contribute to and you
integrate this into a giant giant knowledge graph of course this thing
becomes too big to download so you get web api’s so that it just goes and we
believe that if such a thing were to exist it would be really really useful
for starters in many cases you don’t have to it avoids a repeated data bro
wrangling you avoid the burden of data storage indexing etc the biggest benefit
is from joining and the more you know about two references to an entity the
easier it becomes to join if such a system were to exist we think lots of
people will use it the problem is jump-starting this ecosystem and this is
exactly the kind of thing that Google is good at so we’ve built a version 0.1
we’ve integrated large chunks of ACS BLS FBI CDC NOAA and a whole bunch more
stuff the Python the core engine is open source because we often get to over the
academic community that you guys do something then you lose interest and
then you move on so we decided that what people do is we take the code engine and
make it open source and and we’re working with both Stanford and Berkeley
to create instances of this that they are running there’s two applications
that will briefly show you to give you the idea of what it is but before that
let me actually show you you should be able to go to can you see that is it too
small let me make it a little bit bigger okay
browser the Data Commons start org everything I’m showing you today is
public and open we I come from this amazing beautiful city called Mountain
View in California this is a simple browser for it shows
you a node at a time and all the arcs how to fit a bunch of details you see we
have usgs census and let me actually show you the romeo census FBI BLS CDC
and so on it’s a nicer view as a timeline view this is population of
Mountain View by gender drill down this is women and Mountain View drill down
further this is Asian women in Mountain View and so on and so forth so we have
all this data what is it useful for the first thing that we decided to do is to
actually use it for students according to EDX over the next few years
there’s going to be something like you know two and a half to three million
students are going to take data science courses and many of these data science
courses the day they learn these methods and they then deal apply their tools on
toy datasets here was a study that we did where the students built a model of
obesity across 500 US cities as a function of blood pressure unemployment
and people below the poverty level this would be a multi-day or a multi-week
exercise just gathering all the data to cut a long story short this is BB vs
obesity this is an employment versus obesity this is poverty versus obesity
and then they go on to build complicated models and the good and the interesting
thing from our perspective is that we had 400 students do this in Berkeley it
was the upper division data science course in hundred and hour for grade and
so that if you bring we believe to get an other application that we’re doing is
for data journalism wouldn’t it be great if more journalism was actually based on
fact some people say that the crime rate in border cities is going up there’s a
bunch of variables about them about our pass or the thing that we are interested
in is the violence it doesn’t seem to be going up maybe a different city because
you know did no I’m sure this is crazy Californians – no let’s do this per
capita mmm certainly not – I journals can I get this
chunk of data as the you know javascript code like you can embed maps sure you
want the raw data sure you want to build your own a interface like this for a
different kind of a thing sure you want to go you know navigating around for
interesting stories somebody discovered this thing which is another beautiful
city that we know these cities are really boring but they are there this is
next to where we are we are not crime but if you look at
there’s literally thousands of variables here by the way they can you know but
the thing which were interested in right now is Nativity a very interesting thing
happening in these Silicon Valley places they crossed over the number of native
born is now less and you’d say wow that’s interesting maybe New York is
like that right melting part and all that stuff now New York is not like that
and so what are the variables that correspond to this this is a story and
what we’re doing is we’re giving this thing away so that every journalist can
use it and the only condition is that they put their the citation for where
the data comes from which is leading to other people like we’re working with Raj
Chetty group at Harvard they’re putting their data in Sean Riordan is putting
his data in which means that you get the positive the ecosystem going and you put
your data in you get api’s everybody gets api’s for using this data it’s not
just in this domain we’re also building this kind of the core critical mass
nugget in in epigenetics in energy data and a few other verticals so we’d love
to collaborate with you thank you okay great that was wonderful as a for
those are four great presentations can I start by throwing they just one or
two questions to the panel and then we’ll take questions from the audience
I’m (unintelligible) okay well given the time we have I’m gonna take one question I’m
gonna throw one question then throw it out to the audience the audience has
some time for questions so my question is could you talk a little bit about if
there are challenges with consent procedures when combining different data
sets I know that’s different from industry and and and academia but the
ethics of consent for people to use their data when they come from different
sources and I know that you know we have all the same kind of ethical obligations
but academic research with different procedures for consent than the industry
and then the government does and is it a challenge when you need to tell people
their data is gonna be merged with some other data they don’t realize it can be
merged with is there a danger which is I’m always worried about that they’ll be
completely spooked by this by this notion that some some dataset they don’t
recall even being in you remind them and then you tell them actually this survey
response or his other data set is actually gonna be combined with this
other thing and why is that not scary for people or it is that’s this question
not actually come up very much so how does it happy to anyone who wants to
jump jump in on this we’re about to launch another survey where we ask for
consent and we I just got an email from the IRB about all the modifications I
need to make to the consent so it’s definitely there’s a lot of challenges
that are I’m gonna specifically talk about social media of course it’s gonna
depend on the source but if I if I talk about one case study which is the
Twitter I think the fact that you are connecting to his data that is already
available right so when you get concerned
you have previous information about the respondent and then you can keep
collecting information for as much as you want right so the consent procedures
have to be at least what we are being required to to do is to be very specific
about you know what is the data that we’re collecting and the duration how
long we’re gonna keep collecting data from them so it’s not like a survey you
go and you ask the question you collect it and you’re done
and if you want to go back again you know you your consent has to say that
would you be willing to do the survey again if they are panel right but with
social media data you can continuously keep collecting data on them and you
have to have a duration so can you tell them I’m gonna keep tracking you for 10
years 20 years till you die right so you have to have a time frame so there’s the
issue of you know that the data is you know ongoing and that there’s historical
data that once you access it you know your consent has to specify that if you
are gonna add there’s the issue of followers the followers and following
right so you have a pair you know which we knew about if you look at the
literature about the use of network information about the network it’s
useful you can predict some information by getting information about the network
and it’s a very powerful thing whether in terms of predicting demographics or
other kind of construct variables and you need to tell them you need to tell
them that I am gonna collect information about the followers but does that mean
they have to get go get consent from the followers what does that we still don’t
know I mean I recently we’re dealing with all of those issues
you know the IRB is making us be very specific about what kind of construct
certain constructed variables we’re gonna do are we adding this are we using
this to predict their gender or race right so I think there’s a lot of
challenges to you know understand what is the right type of consent where we
are informing respondent but as you said not freaking them out right so you’re
seeing concentrate without being this I mean the concentrate I’ve showed you
those were not very specific consent language I can show them to you they’re
more or less oh you know generic more or less with some being more more explicit
about other and we still get 24% consent or 27 or 40 so what I don’t know what
would happen to the consent if we’re gonna add a lot more to that which I
think we should but we’re is that fine line I don’t know the
answer to but there is I think specific challenges to consent that at least
we’re dealing with why not just note this is just all rapidly changing right
so when I started running studies in this domain what eight years ago or
something the IRB is viewed all of this stuff has completely exempt and all you
had to do was ask people in the context of the survey and that itself was viewed
as exempt right and so this is changing from a regulatory level very quickly
it’s changing from a public perception level and that people seem to be a
little less willing under certain circumstances to give consent because
there’s more of a recognition of privacy it varies hugely depending on what
country you’re doing things in and so how you think about this and you know
especially if you’re dealing with any of the European regulations right it’s
potentially a fairly different game right how you think about this is is
very context specific to an extent that we can’t even assume it’s going to be
the same four or five years sooner issues of privacy and not freaking
people out especially for like race related terms so we we collect over long
periods of time so for all the papers that we release we generally collect
over a year’s worth of data and we summarize for instance for like a zip
code or County or a tract so that any one person’s tweet is going to be buried
among lots of people for a year and then we connect that with individual health
outcomes from other surveys so in a way we’re trying to make it so that people
are not identifiable that it’s more we’re gonna capture the social
environment and not one person is gonna be identified and when we publish we
don’t publish single tweets we kind of described them in aggregation so that’s
how we kind of pay attention to issues of privacy hi with the caveat that I never I have
never actually conducted a study of the sort you guys have when we do things we
run experiment on 0.1% of our users which is significant wearing a
completely different hat the law and computer science are on a collision
course for the first 30 40 years of computer science law plus computer
science meant intellectual property now you have a situation where exactly what
does it mean to something to be for India to say this is what our national
borders are and what is or what about other rules like what is it the boundary
mean what does principle and privacy in law in terms of law what’s a valid value
for epsilon how do you even explain this to a jury there are so many of these
issues which are going to be you know just like First Amendment’s had this has
its limits you can’t go into an airport and shout you know some things are there
similar limitations for in Raja Reddy has this incredible talk where he says
at what societal cost privacy if you could eliminate human trafficking at
some privacy cost that is a decision who decides that is it individual to decide
that is it the government to decide that is a society who decides that these are
incredible base question capabilities that we have as a society now because of
this that have to be regulated modulated I don’t know by the same mechanisms with
which today’s laws were created and I don’t think that’s going to happen in
from the political process in our current environment it has to come from
the academic environments and so I’m a computer scientist by training so
I’m looking to you guys so that’s a good opportunity to ask for questions from
the audience yes the plan is to do that later in the this
year we decided rather than go shallow and broad we would go very deep in one
area and and you know frankly looking at in terms of open data the u.s. is about
two orders of magnitude ok maybe 100 – you’d better than other countries and
when you have some tool like that I’d much rather have we had the foreign
secretary equivalent of the country formerly known as the United Kingdom we’re visiting us a few weeks ago and
they’re like how come our data’s not there and the idea is to basically say
here make it available and this schema in this format and then get everything
together – just the first question we only started to look for comparison
between geo tagged and non-Geo tag tweets at least with the race variable so with the
race variable we have some tweets having place characteristics so it has like
state-level and more people captured in that and
we’re actually finding robust relationships with adverse birth
outcomes so that seems we in the non geotagged as well as geotagged we were
surprised because like you saw the state level patterns were very different but
yet it’s still lining up and we think that overall our work with Twitter we’re
thinking we look at it like Twitter is an imperfect data set it does not
represent everybody adding cue tags is just it adding more selection bias to it
but at the same time it gives us some information that is able we’re finding
it relevant to predicting health outcomes so even though the information
is selected and not representative it’s still providing you enough that it’s it
can track with certain health outcomes so the question was about the Hispanic
effect right and I think there is some like we I haven’t we haven’t looked at
it explicitly keep in mind that the consent this study was not designed to
look at predictors of consent right so we so there could be different level of
privacy concerns among this age group there could among this ethnic group
there could be other issues related to trustworthiness that they are more
trustworthy they could be more socially active in terms of social media active
this one did not control for frequency of social media use so we’re hoping that
in the next survey that we are gonna do where we are designing it to explore
more of those affect we might if it replicates we don’t know if it will
replicate if it replicates then we might be able to tease out some of those
effects in the coming one but honestly with this one we are very limited I mean
we could think of potential reasons but we are very limited to in terms of what
variables we can add in to see whether that effect goes away or not do you have
any more questions the Google thing is is that mostly
publicly available datasets that are using or is that there are some
proprietary data set sir or is it possible that this framework could be
used by the collection of proprietary entities that so the problem the data
that is publicly available is all public data the architecture is such that it
allows the model is remember we used to have these transparencies before all
these fancy things but you could overlay one on top of the other and if you align
it just right you get the right effect that’s exactly the idea so it’s like
your intranet versus the extra a whole web only the people inside the company
can see their intranet and so let me give you one simple example of where
there’s a newest entity in Montana we were doing a pilot with and I didn’t
know this but it turns out that the suicide rate in Montana is very high and
they’re trying to figure out what on earth could this correlate to it now
they don’t want to release their data that they have to everybody before they
publish it so they are overlaying this on top of all of this other data
Stanford is building Data Commons start where Mark Cullen and some
of his folks are building a layer about Stanford proprietary data which is
accessible only to a very limited number of people who have been approved of
those names are in the IRB and things like that so that it can be many
instances of this just it’s not just one okay I don’t want to shortchange the
second panelist we will do one more question and then we’ll move on to the
second panel thank you and I’ll keep I’ll keep this short also a question for
Ram here – so in terms of the browser data comments that you should
what is it HIPAA compliant you mentioned a couple of health-related verticals
right so you said if the genomics and others yeah so those of us were building
data comments with health data what you use from this in the epigenomics data is
basically encode and so importing the encode project with some people from the
encode this thing and putting it in so for the health data sub given that we
are doing this as Google we have to be 10,000 percent sure that there is no
privacy violation even remotely there so everything that is public comes from and
that is run on the data comments by from us we you can’t nobody can upload it you
have to put it out there publicly and then we will suck it up
so the encode project is all public and so it comes in and the stuff that like
that Mark Cullen and COBOL run etc they are running it it’s theirs they can
they’re all our data is at the base layer that comes into them but then the
layers their own layer they’re controlled and you know if they yet they
run this and it’s their responsibility well please join me in thanking our
wonderful first panel to here today in the center we have Jeff
Chen he is currently the chief information officer at the Bureau of
Economic Analysis prior to that he was the Department of Commerce’s first chief
data scientist and served several other roles in the last administration and
closest to me we have Stephanie Lee Studds who is the division chief at the
Census Bureau for the Economic Indicators Division she is currently
responsible for twelve of the thirteen principal economic indicators for the
Census Bureau she’s been with the bureau for twenty years and has a background in
survey work business architecture and now twelve of the thirteen economic
indicators so this is our group who’s going to be discussing how data blending
is occurring or should occur in the federal government great so so this is
better yeah welcome this this group is really quite distinct from the prior
panel I think on on two dimensions one we want to drill into what’s going on in
the Federal Statistical system mainly BEA and the Bureau of Economic Analysis
and and census but two will be emphasizing economic statistics much
more than demographic statistics so there’s a big contrast going on here the
wonderful thing about the characters in this room is that we can be as geeky as
we want to be I know you guys speak to to wide audiences but but you can be as
geeked out as you want so welcome on that and probably for this audience the
the most important thing is to give us examples of real things that are going
on and I know both of you are dealing with with different private sector data
sources that are non-traditional to your agency so Jeff maybe you want to start and there I want to start off by talking
about blending in the prediction context of focusing on y hat so we want
producing some prediction of zombies some target series based on sampling
signal borrowing signal from of a blended set of dealer so if the GP
number comes out as advance estimate and then followed by a second third estimate
that incorporates the information but sometimes this data sometimes data that
we need does it get to us until the second estimate so that means this
person or that comes out for GDP may miss some core signals to them in order
to pull that number through we have to use a forecast so what the way we’ve
been thinking of data blending in the prediction context is to look at a
difference versus if possible to patch together a record size of the signal
something that would mimic the underlying trend I think in the economy
and it’s a one area that we really need help on is the services sector so about
46 percent of GDP is the services sector numbers and we basically underwent this
learning algorithm development initiative to predict reserves the
secular numbers and we do this by hashing together like monthly level data
from unemployment and inflation indicators as well as credit card
transactions index frequencies and mixes of geography as well as and my credit
card transaction date is actually pretty key because that’s a DFS me that was
made available to us your partnership with the Census Bureau and so and then
there’s also what kind of prediction project would there be without
web traffic data from Google that’s a weekly busy so together you’re looking
about thousand indicators using machine learning techniques such as rain forests
to narrow networks to produce an unsolved the prediction and what we find
is and actually shave off one-third of the
revision on our predictions on GP in Keith the area so the health sector is
about 1/8 of you know we can knock off about 30 33 percents the revision just
through production and use this data blooded approach there are some
shortcomings to it so it’s a bit of departure time the social science
traditions where you have a structural model that you want to estimate and tell
the story but since we’re focused on who’s producing a prediction that will
mimic economic activity it’s actually pretty good
so you’re predicting at the national level at the moment of the release of
the preliminary estimate and you’re actually predicting the second the first
revision the third revision okay and and so and that’s at the national level but
these data allow you to do other things as well oh go ahead run we switch to
Stephanie’s types of similar did so as Jeff had mentioned we at Census came
together with VEA to share some of these same datasets our business tactic
was slightly different than what B EA was trying to accomplish
I couldn’t have spoken better than some of the people in the first panel did
where they talked about survey respondents the fatigue the increase in
cost the reduction of response rates things of that nature so census has
really started in early engagement in really looking at these third-party data
sets or alternative data sources to actually replace missing respondent data
and it’s taken us a while to get there so one of the one of the core uses that
we had as Jeff was talking about was we were able to use some of that data in
seasonal adjustment in the arena seats for some of our daily things such as
Super Bowl Sunday Cyber Monday things of that nature it didn’t work for us
necessarily in the initially gage meant to do some of the other
things we really wanted to do which was replace missing respondent data and why
is that so incredibly important to us one we want to make sure that we can
continue to provide really good statistics that gold standard that the
Census Bureau is so well-known for but we’re also getting a lot of questions
and a lot of requests from our data user community to produce more timely even
faster than the monthly or the quarterly indicators that we put out they want
more product level detail so in some of our engagements what we’ve been able to
find is how we can begin to produce additional products for our data
stakeholders without actually having to add burden so that’s been what one of
the I would say true advantages to some of the relationships we’ve built with
some of our groups I would say transparency it’s critically important
in this so and I can talk more about that later but the transparency of the
vendors and the groups of the alternative data sources has been
critical we want to understand as I heard many of you say today where is the
data originating from what is the original where did it come from what has
been added to the data really any methodology that’s been used behind it
so on a number of our groups that we’re working with the true success has been
our partnerships with them truly understanding the data them
understanding our business need and then then being able to actually share how
that data is delivered to them and what’s been done to it the process
without that I would say we’ve had some areas that haven’t gone so well with
these alternative data Jeff once you say the makes them similar
sort of comments about how you know what you’re getting when you have the data
what have you done it be we focused on figuring out if there are
substantial buy season estimates in particularly ones that when the key
things with any of these alternative sources is to tease out the signal of
that company from the economy signals so there’s a lot of pre-processing in
feature engineering that has to happen in order to weed out these odd signals
so in the case this is a pretty fair transaction where is that there are
large spikes in the raw series because these companies are grow and contract
over time and so in order to a level out the signal we have to use some
traditional survey techniques to have a good constant change sample but also
incorporate other model-based approaches to smooth out the signal so and that’s
that’s a common theme across all our work like we have another data size that
we’re looking at an anonymous ride share so what are the data that you’re getting
from there so in order to bring this together we have full population
transactions verbs like US cities for taxis and then we have some city is
actually gets the total count share of data but then we don’t have prices so
the prices we get from an email cattle so it said some of these companies
they’re any offer services to public and they get their email receipts and
anonymize them and turn them into something that is magical so essentially
we’re producing this we have roughly got roughly a billion transactions across
three US cities in three years we have a hundred and nine thousand rabbits that
are taken and the time frequencies are all mixed and they have to be matched
into a way that would be usable and so one way that we can make this useful is
flying thinking about bringing our sampling design hats on and developing
actually so I assume the first time you looked at these data sets you spend an
inordinate amount of time in this pre-processing does it get better
absolutely I mean it points and then we had some code that was scalable to
multiple cities and four weeks and this was only two people’s you’re just
bragging yeah so I would say we’ve had a much different approach as I said
earlier we’re actually looking to improve our non-response and reduce
burden for our companies so we’ve worked out negotiated for that in order for us
to be able to do that replacement of data we’ve done an awful lot of research
with our 2012 economic census with our annual surveys with our monthly
indicator surveys to say does the data in fact look very much like what the
company already gave us and in many instances we have found that it does
match very very well so there’s a there’s a complexity there that we feel
comfortable in using that data but in doing so we found issues with their data
sets as found or we have found some issues with
our datasets as they gave us that data so I would say it’s been a much longer
process for us I would say we’re still in the early stages we’ve been at it for
a couple of years and anytime we embark on taking data that didn’t come directly
from our survey respondents we’re gonna take that time to very clearly make sure
that it emulates what a survey respondent would would have given us and
in many instances we’re still working with them as we’re getting these data
feeds so they’re very aware we’re getting them they’re part of the process
and we still do maintain those relationships and have discussions with
those organizations so reflecting back on the data holders when you were first
dealing with them and I’m sure we’d all be interested in what sort of cultural
barriers did you did you encounter with regard to the rigor with which I know
your agencies deal with data and how they viewed data so what are the what
are the memorable stories that you have in your head with those kind of early
conversations I think I’ll just talk more generally evaluate but two distinct
classes of data vendors actually three distinct classes there’s one group that
is in it to sell the data and pure pure sales there’s a really small team as
working in fast and loose and they just want that game that sale out the door
there’s another group that is a data aggregator actually has a measurement
science team so they have these like really smart survey statisticians and
physicists on the same team trying to figure out what the sampling which
should be in to give you a full package and then there’s this other group which
is typically more on the technology side tech companies that are trying to throw
their hat into the ring and get you to see you some their data maybe it’s to
tell part of the story of the economy that is not really being told right now
and each of those groups have very different reactions first group
obviously they want to get you a pilot they
to throw high price tag on and they will give you a a great business development
engineering team and as soon as you sign a contract you swap them out so about
the 18 for the B team then the next team is the same team throughout the process
and the group that gave us the we purchased the email receipt data from
they were just phenomenal you have a great dedicated team and they didn’t
even invoice us until we were completely confident the data was good but then and
then the last group it’s tough because the the tech companies some of them
require the CEO of a big tech company to sign off on release of information and
then others they just want press they want to have publicity and it’s a very
last stop and go it’s really on their terms that we know that pastes up on
their terms not ours and the data quality and the last group it the
oftentimes we’ll go and get contracts with the biggest and best economists and
econometricians and publishing big big journals to make the point that the data
is good so it’s a very broad range different stories how about you Stephanie
so I would say the first group that Jeff was talking about is probably our joint
relationship that we had and I would say very much there was a lot of concern
there because not a lot of transparency into how they produce their data at
least for what we needed which was different than the solution that Jeff
was looking for so we very quickly after numerous conversations said you know we
just don’t see the value to us here it was okay the data we’ve looked at it it
just didn’t give us the coverage there was things that were missing that was
solidified us and bring us that gold standard and of course they still want
to continue to bring in the a-team to sell you and then same thing we got the
B team the other organization that we’re working with has higher data scientists
actually to work with us so they’re an aggregator who has gotten data there I
wouldn’t say they’re a huge company they’re probably a medium-sized
organization who is working very closely with us like they’re very
transparent in how they get the data they’ve worked very closely with us we
gave them special sworn status at the Census Bureau so we could talk companies
we have a I would say three-way relationship so between our companies
with them with us and that’s truly proven to work very very well the
transparency the back-and-forth with the companies it’s giving us that good
feeling that the data we’re getting and how we’re using it is headed in the
right direction we’re also working on our construction indicator programs as
well working on building permit data and you know a lot of the building permit
data is publicly available however there’s so much characteristic data
behind that that some of these vendors have it’s been a unique opportunity to
work with them because I don’t feel as much the sales protocol but really truly
again trying to understand the business need of the Census Bureau and how they
can help us and what we found there isn’t as much as we found in some of the
others that would take multiple vendors to actually service us to get all of
those pieces and we’ve worked real close with just team on Zillow data as well
working with them in the same respect so we’ve had a lot of good opportunities
some want to just give us data and let us work with it and work with us and
like I said higher data scientists to really do that and then you have to be
wary of I would say the ones who swap the a and B team and if you weren’t very
transparent how they’re doing business and it’s more of a sale to them and to
say that they’re working with statistical agencies in the government
sector so when I know when the Federal Statistical agencies first started
talking about this there were there were some concerns about losing control in
the kind of way that Josh was talking about how surveys are under design
control but but you’re giving you’re losing that control when you move to
these vendors and one of the fears was that there would be breaks and time
series that were actually undiagnosed by the firm producing the data have you yet
encountered sort of a shock to a data series over time so I can speak to that well
and Jeff’s thinking a little bit so I would say from our perspective where as
I said in the early stages and we’ve only used a few companies for
replacement of data so no I wouldn’t say there hasn’t been any time series issues
or any of that I would say we’ve seen more of a fulfillment of more detailed
data that we’re actually getting that we weren’t getting before when we talk a
little bit about like building permits and things of that nature and again
protecting that of which I’ll speak to the indicators we’re really holding back
that even if we went with one vendor the most they can ever give us is like
thirty percent of an industry or as as I was saying the building permits we have
different vending or vendors providing us that data we don’t see in our
interactions currently us losing that control it’s more of minimizing the cost
and a burden to our respondents that’s really you know what we’re attempting to
do if they’re already providing it to these entities but we have not seen the
time series issues yet so how much haven’t been at yeah long enough to see
a break just yet I’ve only been with the agency for about 18 months but most of
my work has been focused on machine learning side of the house so developing
models that word predict and these methods are constantly being retrained
and recalibrated to adapt to the information environment and we have
controls in place to detect when there’s a break and so I don’t have a story to
tell you about that the years behind the breaks but wait do you have to say is
that as we are getting into in more information rich world we should be
looking at made different series as possible and not necessarily to just to
not not only improve the accuracy of our estimates but to actually probe us defy
them if we composite multiple series multiple credit card transactions and
series or multiple employment series then we are getting more information
that we can in theory produce a more stable signal
and we don’t put a too much bias and too much emphasis too much leverage on one
specific series like there’s some forecasts that we put out enough
forecasts like there are there short main projections for the advanced
estimate that are leveraging only a couple of indicators and so if we’re
using more many more we’re really diversifying that signal and making sure
that we’re producing a clean estimate mm-hmm so it seems like the two use
cases you’ve you’ve emphasized are filling in gaps from non-response from a
traditional design measurement process and then you’re using the the advantage
timeliness of another set of data to forecast out the other uses that we all
aspire to were were those that were disaggregating estimates using finer
grained data and do you see that happening in your agencies in any way so
this would be say regional or state level GDPs using a blended estimate at
that level I’m going to plane with that sort of so most of our estimates are
actually blended already as a regional regionals county level GDP has blends of
employment data there’s their days that’s are using for marketing research
that are being used and that there’s a lot of trade journals and trade data
sets that are being brought in to blend at different and depending on the type
of location they’ll bring in different information so there is a lot of
blending that’s being done are you thinking about more in terms of like
micro data level I’m thinking bringing in the private sector data to in your
terms robusta Phi the small area estimates and in some way and sounds
like you’re doing that your culture has already taken that step
of using multiple data at lower levels of aggregation what’s happening that at
census on that scroll so I’ll use a couple of examples we’re currently
working on so in the retail sector more specifically I would say we’ve had two
things that have come out recently from the National Retail Federation and the
International Council of shopping centers so when we started to engage
with these third parties one of the core things we were looking at was being able
to produce estimates at potentially a monthly segment geographic
levels well once we got to working with the
groups they were like actually what would be more beneficial to us with the
changes in brick-and-mortar and e-commerce it’s more detail at the
e-commerce level and more at the product level so we very quickly in about four
to six months put an agile group together to come up with the beginnings
of a new e-commerce product that’s going out as experimental in what I mentioned
earlier where we’re replacing some of the data from non respondents we are
using that and in the second case we’re looking to actually publish
approximately 1,800 groups of what we have in makes and apps to what our
vendor has at product level detail so funny enough we’ve been having like
product day at the Census Bureau trivet spoon rests you name it we’ve talked
about it and so what we’re currently doing is really working on using that
data and refining that data to be able to provide our data user community on a
much faster basis detailed that they would typically only see in like the
economic census so we’re really letting our data user community drive the needs
and really what we’re going after with these third-party vendors so I would say
that’s two examples and then with our building permits and construction we’re
actually looking to potentially put up more characteristics at the lower
geography levels that we can do with disclosure review so I’m interested in
one other thing that there’s another difference between this panel and the
first panel and that you put out official statistics people make real
economic decisions on the indicators you put out and so the
care with which you move from the old method of doing something to a new
official method has to be a careful transition so one of you give us some
insights and just how the culture inside the agencies are making those steps so
for the brand we have multiple prediction efforts going on and one one
way we we kind of show off the new tech is to pilot idea parallel horse race
with current methods and so we let these we build the models kind of treat them
like so this is gonna be a Star Wars reference but Luke Skywalker flying
through the trenches of the Death Star with the little eyepiece he doesn’t
actually fault use the eyepiece so basically what these ml models are that
eyepiece for people who want to use them so essentially these things tell you the
prediction models tell you how likely are you to have a number in certain
range and if that’s gonna be a large improvement and why it’s gonna be a
large improvement was a history history of improvement how likely to make that
shot over and over again so essentially it’s gotten a lot of great feedback and
use it’s not if an official part of estimates but it helps inform and guide
our analysts and their their research decisions and that in turn gives us more
capital to go off and get and explore explore new data sets and bring that
into our estimates we are looking at ways to make these official parts of
estimate but I think the prevailing school of thought is that we should
build more of these so we have competing models and see what are competing
consensus consensus models I would say for us as I keep saying
we’re in those early phases and there is a lot of oversight to everything we’re
attempting to do it’s very much supported inside the Census Bureau they
see this as our future and this is where we need to evolve to with data science
and using alternative data sources and again we’ve been using
administrative data sources for a number of years now we’re looking to
third-party more of a commercialized front for these data sources and I would
say it’s been really going backwards in time and using the data that we do have
to our annuals to our monthly as to our censuses and ensuring that they do
follow the patterns having a lot of methodology oversight and review to
everything we’ve done so far and it’s been swell it’s been a slow process but
we never want to engage in something very very quickly that could change that
gold standard of the statistics we release and I would say the other piece
is really helping us through the methodology concepts how things are
changing over time and some of the things we noticed was in our methodology
you know when you get those lower-level details so you’re getting data from
non-respondents and specific industries what you’re actually starting to see is
maybe what the trends should have been closer to and what they were showing
what based on imputation so there’s a lot of things we’re learning throughout
this process well we’re very much supported inside the Census Bureau and I
would say very much from the community of people and businesses that we’ve been
working with I would just say we’re moving with caution just because we
don’t you know we don’t want to make those moves too quickly mm-hmm so that’s
an example of the on the imputation where you’re alerted to a process that
we’ve been using for years and trusting and looking at it with new eyes yes yeah
and how we may alter those methodologies or update them in the future so I think
we see it as a lot of opportunity and instead of rushing for one it’s really
looking at it globally to say what other processes can we improve so you guys are
pioneers in the real way building a new paradigm as was mentioned earlier and at
any time in human endeavors like this you learn things at a very very rapid
rate so what are some of the surprise learnings or things that didn’t work
that you were convinced should have worked and you know
what are the what are the things you now know that you wish you knew when you
first started those kind of stories what comes to mind
I suppose one thing that we thought would be feasible to do is to predict
every single indicator Under the Sun and it with some degree of accuracy because
we were looking at some other research that some folks said and suggested I was
pretty feasible but what we found out was in order to produce a reliable
prediction that would be useful to everyone we have to come up with a
measure of predictability so it’s some some function of overall our error rate
as well as if it’s a gift they’ll keep on giving meaning it’ll slash revisions
and every single quarter and so when we apply these these standards to our
predictions we find that only maybe one third of the indicators that were
interested in are actually predictable but a good thing is that for health in
the case of health again very health is 87% predictable and we were able to show
sloshed about one-third of the error off of health which is pretty nice the other
other key thing was that coming from more of a pure stats background I’m
working in an agency full of economists we tend to talk past each other know so
we eventually built a bridge like we have a vocabulary a common set
vocabulary now that allows us to communicate ideas that are not egregious
in the other field and good I would say I would say our funniest story at census
I think was how many of the entities that we’ve talked to once they’re inside
the building we actually had one yesterday are actually using census data
you know it’s pretty funny that their data sets include our data and that what
they really do want to do is come into the building to get more interest in the
data that we have in understanding how we’re doing it first is what we’re
really looking for is what data do you have I think the
other big thing we’ve learned is you know we’ve jumped into a couple of
pilots and we did have what I would call a couple of failures like we’ve really
started to refine what we’re looking for and our business needs and those kinds
of things and I would say the other piece that we’ve probably found in that
is just building the partnership I would say that has been the biggest thing to
overcome it’s just building that relationship with the vendors like I
think that’s been our biggest win in all of this transparency this might be a
good time to switch to the there’s a lot of food for thought and what you’ve said
so why don’t we open it up to the audience
other questions or comments yeah thank you
that had to do with the forecasting that you’re doing and your experience with it in your experience did you find that
there were differences by domain you’re in your ability to forecast and by
repeating the process could you develop a sense of improvement by whatever
measure you would use over time are you getting better at this yes
so we’ve purposely developed our ML methods so that they can be these a
repurposed to any domain but what we do find is that they tend to because the
pop following is going to be very intuitive they do tend to be poorly
import there tends to be poor accuracy in areas where the sampling error is
large so that’s intuitive so which is it’s a good thing there’s a good thing
the theory works and so that also now we know that when we’re producing
predictions we should focus on areas where
the small small sampling it reduces in total costs that we have to put
into cloud infrastructure to produce these models we we have a with a
Bayesian spatial econometrician on staff who has been arguing that we
should start to not only use the ML to produce the ML process but then wrapping
it within a Bayesian framework which would then help us get the uncertainties
of the models the tough thing is that some of our models run 800,000 of
iterations already so if we put into the Bayesian context with a Gibbs
sampler we need we probably need a bigger set of infrastructure to do that
but I’m sure that the the uncertainties I would get from that would be would be
quite telling for for the economic estimation process and would probably
tell us if we can reproduce the same answers over and over again over time we
do see that there is a improvement in accuracy but it might just be a matter of
statistical pattern the unit at which you are getting the
data from different vendors they are pretty much different right there are
some of them are giving you at the county level some of them state some of
them aggregated at national level a some are National some are state the time
frequencies will vary from daily to quarterly so it’s a it’s a wide range so
are you are you synthesizing all of them into a one particular
time and one one specific time is a national level quarterly okay
so there’s not there’s a potential possibility that you’re losing some of
those variations by not modeling it at the different levels that they’re giving
you the data yeah so if we think about it in terms of like the jagged edge
and ragged edge problems for now casting we could definitely could do
that but at the moment we’re we’re sticking at the quarterly level
and we’re trying to move away from doing single series models to kind of be a
general economic algorithm for the for the entire economy but so when you say
single series right so every single time series very like serves a sector of SMA
has a separate set of models but now we’re trying to get to a point where we
can have one generalizable internally consistent model that will rule them all
and and then you’re also using the the error rate that is that how much
division that you’re doing from the previous from the first to the second
the second to the third in in your forecasting that we don’t incorporate
that explicitly but it’s used as a way to rate the model so we use rather than
using like cross-validation check we use a cross validation technique called M
blocks we we we basically produce a prediction for every single time period
and we use that we put that through our GDP estimation process and get a
revision estimate and then that becomes part of our error lip-reading thank you
there’s a question by the column I have a practical question to the extent that
you need different skill sets to even experiment with some of these methods
and data sets how are you handling that the federal government in terms of our
you’re retraining employees are you trying to get different people inside
the government and what’s that like given the constraints under which you
work when even for those of us outside you can’t pay tech salaries it’s
difficult to find people so that’s a great question I would say right now the
Census Bureau is leading a pilot for data science
so there’s a big calling in the Federal Statistical agencies for data scientists
we’ve actually put a pilot on the ground in the Census Bureau to really look at
what horses what skill sets how to bring in and attract data scientist so I see
it as a kind of a two-edged sword we have to retrain the staff we have to a
degree and bring them forward to be part of this evolution that we’re doing but
at the same time when you’re trying to attract these data science students
coming out of college we also have to have new projects and things so the
projects I’ve talked a little bit about where we’re looking at this data we’re
bringing these data scientists in where we can
I mean we’re desperately trying to attract them and not only having them
learn our survey cycle process but really turning them loose in some of
these new types of data alternate data sources methodology things of that
nature that’s been successful in not only bringing them in but we have had
the struggles with meeting the salaries I mean we’ve lost a couple that we were
really hoping to keep so we are working as a statistical agency with OPM to not
only develop this series as a data scientist but also see what we can do as
far as what those pay scales would look like so that’s all being done at the
Census Bureau would BEA and others to work with OPM to do that so on BEA side
we take two to two paths to working on the days
effort first is we have a core group of really good computational researchers
that do the really really sophisticated modeling work and then I kind of hammer
in the engineering component to it so that it becomes a productionizable
piece of art but then the other side is like raising the human capital level and
so for that we we I lead to internal I call them panels so that basically we
take folks who are just interested and technically capable and we push them all
at the same time through the process of developing a prototype and so we have
one set of prototypes for forecasting for international regional and Industry
estimates and so that all those folks are and there are the process of
constructing a data set that would be useful for a machine learning based
prediction and then we have one set of teams right now I think are that are
focused on outlier detection outlier detection as we all know might be a
little bit subjective and so we’re trying to get to consensus on this but
through a combination of building having a each team build but then share a
common code that can be applied across the whole bureau so we’re trying to get
this hands-on experience while building the the new flashier product at the same
time let me let me probe a little on this because there are lot of people
here who are involved in higher education are trying to produce products
who are people who may serve your needs so what mismatches are you seen between
data science programs and your needs as an agency what do you have to
train the data scientist to do once you get a hold of one well I would start
from the government side which is we need to train ourselves to portray and
train the entire government apparatus to think about it is data science problems
in a practical fashion there there was a long
time where we’re data science was a hot topic in government but a lot of folks
didn’t actually know what it entailed and so finding those problems that are
can lend themselves to a data science solution is actually probably the first
thing on the government side the in terms of the the curricula
I’d say training folks too so the technical foundations are immensely
critical especially for prediction problems and parameter estimation
problems but ask me yeah it’s able to ask a question appropriately in a data
context is she surprisingly rare right there I don’t encountered that many
people who asked the right question they you’ll build something and then someone
will ask well why’d you do that and they can’t answer it what so having the
ability to ask that right question is probably what I would want to see more
of from the census side I would say from the data science perspective we’ve
looked a lot at what some of the big challenges are that we’re facing you
know I’ll go back to budgets response rates those kinds of things and what
we’re really looking for is to challenge the existing staff we have to really
help us evolve and rotate into how are we gonna get to this future state and in
Econ specifically this has been a big push from Nick and from Ron both to get
us to think more in this light so what we’ve done is really thought about what
I would say just talking about what are some of the big challenges we have and
projects that align themselves very well to the data science environment so I
would say we have about a dozen of those projects and what we’ve done is created
a curriculum that’s really starting to look at the the problem solving but also
bringing in a little bit of the computer science so it’s really evolving what
this person’s gonna look like and actually putting those things together
with the core curriculum and as they’re finishing their assignments in the
curriculum and being part of these courses we’ve had a lot of like our
civic digital fellows been extreme computer scientists and
presidential fellows helping us guide this process and so their big tie was
helping us identify the projects the curriculum and then how to make sure
they’re successful but then how do you continue to evolve the organization to
support such efforts and so the other part that I think is a little difficult
is as we attract these really sophisticated people coming out of
school with the skill sets is keeping them engaged and really embedding them
with the groups that we have now to really move this forward at a lower
level not just coming from the management spectrum down we’re really
getting it going and we’ve seen some success with that like just some of the
projects we held a big day to day where we brought all of our civic digital
fellows and other organizations to census and we showed off some of the
really cool projects that had done been done just over the summer it really got
people excited about what the future would look like and it was amazing the
outpouring we had of people who really want to be a part of this it’s a great
way to end so we don’t know how to thank you for joining us at this little
University on the hilltop thank you very much I do okay great I’m actually on my
tiptoes so you might have noticed that I’m about two inches taller than I was
earlier today when I had and it doesn’t seem to be helping actually okay
so I think that was wonderful thank you I’d like to thank the panelists and I’d
also like to thank those who participated in the conversation they
were really two very different takes on data blending and I think it’s really
important to understand all the different ways people are blending pros
and cons of these different approaches so maybe we can give one more round of
applause to all of those who helped participate and of course these types of
events do not occur with only one person doing the work and so I’d really like to
take a moment to thank those who really helped with the planning I’d like to
thank Michelyne Chavez for actually making sure this all came together today
I think she was the most instrumental in making sure that the event came came
together I’d like to thank Mike tergat in Amy O’Hara for their support and
helping kind of design the event and of course I’d like to thank MDI and the
McCourt School of Public Policy for funding the event to make sure that we
could have a nice reception following the event today so I hope all of you
will stick around chat with some of the panelists and with the other
participants and tomorrow there is a small group of us that are going to be
working on a white paper related to data blending that will eventually become a
product that we post at MDI if you have thoughts that you’d like to share with
the panelists and others who will be participating in that please come and
chat with us because we do want to put together a set of best practices that
might be useful to a broader community thanks a lot for coming and enjoy the

Leave a Reply

Your email address will not be published. Required fields are marked *