Articles

Linking Administrative and Survey Data: Session 2


[Ambyr] Okay, let’s get started. Good afternoon
and welcome to Session 2 of the Linking Administrative Data with
Survey Data virtual data training. I’m your host, Ambyr Amen-Ra. I’m the project data manager here at ICPSR on
the campus of the University of Michigan. Please remember to mute yourselves
during the training. Instructors will announce when to take yourselves off mute.
If you have any questions or technical difficulties, please feel free to use the
BlueJeans chat feature or you can email me or Bianca. I have a few announcements.
Instructors will be posting notes to assist with the exercises at the end of
each session. We will be using groups and conferences for small group exercises.
The directions are posted in the announcements, I put them in the BlueJeans
chat and right before the exercise I’ll also walk you through the process.
I would like to introduce Zachary Seeskin who will walk you through
understanding administrative data, data use agreements, and informed consent. [Zach] Great. Thanks so much Ambyr. I’ll just go ahead and set up the slide. Okay. The
first thing is can everybody see the slides okay and see and hear me okay? Excellent. I’m really
excited about this topic and think that this is a really neat subject for this course. I think we hear somebody’s background noise. If you could mute
your microphone please. I think that the way both of research, new innovations in
research and also in official and government statistics are looking more
and more to using data sources that go beyond surveys, recognizing the potential
and the value of administrative and related data sources for research and recognizing
the challenges that are facing using survey data alone. Where with
increasing cost for surveys and declining response rate overtime presents some
challenges though. So today we’re going to discuss administrative data qualities.
So before you conduct your data linkage and you’re interested and you’ve
identified the administrative data set you want to work with and you want to
understand its data quality to inform your analysis plan. How would you do that?
What are the particular issues facing administrative data? And then go into
some aspects of managing data linkage project including data use agreements
and informed consent. So the plan for today, I want to start off by making sure
that everyone’s caught up and see if there’s any questions based upon
yesterday’s lecture or the project proposal so far. And as I mentioned,
today’s main focus will be on administrative data whereas future
lectures are going to talk about the questions concerning using the linked
data. Today we’re going to particularly focus on the administrative data set or
commercial data set or other data set for a policy research. Sometimes you have
a chance to analyze and learn about the administrative data set before
conducting your research and so how would you go about doing that?
So by the end of today you’ll have a definition of data quality and what that
means in the context of policy research. You’ll understand what particular issues
face administrative data and how they are different from analyzing survey data.
And understanding data quality which is a multi-dimensional concept of: what are
the different aspects of it that are important, how would you assess data
quality, and what are the limitations? At the end of the lecture, we’ll focus on
again the legal requirements, the ethical responsibility to protect the privacy of
individuals or entities represented in the data. So we’ll discuss
specifically, data use agreements and informed consent or
participants or having their data linked to the survey data set. And
we’ll wrap up by talking about what’s coming next in the course and the
project proposal work as it relates to this lecture. I just want to start off,
see if there are any questions about the course so far or material from the first
lecture and also about your project proposals. So you can either unmute
yourself to ask a question or enter your question in the chat chat box. Will give
folks a bit of time. [pause] Okay so it looks like perhaps no questions at this time.
But I will keep an eye on the chat box in case any questions come in in the
next couple minutes. Okay so a recap of Rupa’s first lecture. We defined the kinds
of administrative data set that are in scope for this course using a fairly
broad definition of what administrative data are. This includes the data that are
traditionally described as administrative. That would be data from
federal, state, or local agencies typically administering a government
programs. But we’re also including data that come from a private sector
company that can be acquired for policy research. So sometimes refer to
this as commercial data. We learned about the NSECE and its important for policy
research and for planning. The valuable data that it provides on family’s child
care needs and the availability of child care providers. And then Rupa reviewed prominent
examples of data linkages to the NSECE and I think she did a good job of
representing the range of the kinds of datasets that can be linked and the value
that data linkage can provide. So one example was child care subsidy records
that provide information about low-income families that can enrich and
add to what’s in the survey data and data that can also be used to validate
the survey measures. A very different kind of example was real estate finance
data from Zillow, for example. So it’s an example of private sector data. You could
even think about a version of this where it could be scraping data from the web,
Zillow has an API for example. In this case the data is able to be acquired
directly from Zillow. So you can study how home values are related to the use
of provision of child care in different areas. And Rupa provided many other kinds
of examples representing the rich possibilities when conducting data
linkage. So a definition of data quality. I provide a simple and quite
commonly used definition of data quality: as the fitness of a dataset to serve
its intended use. So for this course, we’re focused on policy research. We’re
specifically interested in how to use data to drop statistical inferences to
inform policy decisions. There’s other uses of data of course, administrative
data are commonly used for the administration of a program
for planning. And just depending upon the kind of use, data quality could
have a very different meaning in what you need for that use. Data are commonly
used for decision making in different context but decision making in a
business context might be different from policy research where for policy, we’re
really we are interested in ethical concerns, people being treated fairly.
There’s additional considerations maintaining the public trust. And we’re
also combining administrative data with survey data and we’re thinking about
using the linkages to use the data sources in different ways, to use
different strengths from the data sources. So the requirements of data
quality may be different for even different kinds of linkages. So we’ll
discuss some examples later and an exercise where there’s a difference in what you might require from data quality if the data
that are coming from your ministry of data source you can use the dependent
variables in your analysis are the objects of the analysis versus if you’re
using them as covariates in a statistical model do adjust a model that
may be based primarily on your survey data. Two very valuable uses of
administrative data but if different quality consideration in those situations. So for administrative data we need data
quality support, joint statistical inferences to inform policy
conclusions. We have to think about what aspects of the statistical inferences
that are important to guide policy. A classic textbook from the Shadish,
Cook, and Campbell and a really nice discussion of aspects of statistical
inference. They describe four major aspects of validity of an inference. Many,
these are important for administrative data. The first would be Statistical
conclusion validity. This would be conclusions about the relationships
among variables. This would really refer more to correlations between variables
rather than causation, having appropriate statistical power and good measurement
would be, would both be aspects of statistical conclusion validity. Internal validity
goes a step further, in that on assessing cause-and-effect if you
implement a policy what would the effects and the outcomes
be? The gold standard for having inferences with internal validity is to
conduct randomized controlled trial. But we can also conduct
causal analysis using observational data sets, it’s just a little bit more
challenging to draw those kinds of inferences. Construct validity would refer
to either a measurement measures a concept that intended to be studied. So a
classic kind of example is if you want to study intelligence and
you’re measuring test scores, do the test scores capture the intelligence you want
to measure well enough or is that too indirect of a measure? And we could think
about an example that pertains to administrative data or a little bit
simpler though, let’s say you are interested in measuring somebody’s income but you’re
only able to capture information about their salary, let’s say you have data
from from some employers and have people salaries but not their entire income. So
that salary is very valuable to have but maybe doesn’t fully capture the
concept of the income that would be a better measure to have. So different
kinds of examples of the challenges of construct validity in
using administrative data sources. External validity would refer to you
draw an inference from your data set, you know, it pertains to the data that you
analyzed but is that inference generalizable to the broader population
that you’re interested in? This is where surveys have particular strengths where
we can conduct random sampling that allows the data to be representative and
allowed externally valid inferences for the population of
interest. So I would say for surveys, we’re usually most interested in
statistical conclusions validity, construct validity, and particularly
external validity. Often internal validity is left to randomized control
trials but people also analyze survey datasets that draw causal inferences of
course. So different situations administrative datasets could have
different strengths and weaknesses for these aspects of statistical validity. Okay so next I want to break into our
groups for our first of two exercises. So based upon what you know about
administrative data or what you think some of the issues might be, I want you
to brainstorm three examples of what aspects of an administrative data source
would make that source either higher or lower quality for conducting policy
research. And I think Ambyr can provide further instruction on breaking into the
small group. [Ambyr] Okay so don’t leave the BlueJean session, but keep it muted. In
Canvas you’ll see a group icon on the left side of the navigation bar. Click on
groups, once you’re there go down and click on conference. And you’ll see
Session 2’s conference has already started, you can click join. Once you’re
inside just use the chat feature on the right hand side. [Zach] Excellent, thanks Ambyr. So I think we’ll give around eight minutes for teams to discuss and brainstorm. And
then we’ll regroup and I’ll ask for a representative from each team to report
back and share ideas. I’ll aim to give about a five-minute warning
and a two-minute warning. Thanks. [pause for group work] About five more minutes for each team. It
looks like the teams were all able to connect smoothly but if anybody had an
issue, let us know in the chat box. Thanks. [pause for group work] And last couple minutes for each team. [pause for group work] All right, so let’s regroup. I’ll stop
sharing the slide for a second for discussion. So I think there are two
groups, so if I could have a volunteer from each group give a couple examples what
aspects of data quality you would want to assess for an administrative data source. [Madeleine] This is Madeleine, I’ll start for Group 2. We talked about the timeliness
of the data. So can you get it from the administrative source in a timeline
that makes sense for a policy relevant issue? Data quality and cleanliness, the
availability of a good linking variable to the data so that you have, scale or
unit issues. So is it being collected at the right geography or household level
or whatever it is that you’re trying to deal with? And then sort of,
interpretation of what they have in their data. So the metadata/
documentation and how good that is, and whether they’ve explained what exactly
the data that you’re looking at means. That’s great, that’s a very thorough list. Good
examples of where we’re going. So as we will discuss, the challenge of
administrative data is that you get what you get, that you don’t have a
statistical officer/agency who’s making sure that the data have all the
properties, and the cleanliness, and are working in a certain time frame to make
sure that the data are available … that the data has all the aspects of quality
that you would want to have necessarily. So these are all things that often can
be overcome but things you want to check. Those are all great examples. Group
number 1, a couple examples which came when you’re brainstorming either similar
or different from Group 2? [pause] A volunteer from Group number 1? [Tingting] Okay, so I think our group
talked primarily the same as what Group 2 just covered, in terms of the
inconsistencies among the constructs and data cleanliness. And also sometimes I
feel like in my own experience, the data collected by DOE, they’re just very
basic and only covers certain demographics. Because it’s not
specifically designed for statistical analysis, so a lot of the variables were
probably are missing from what the researchers having mind. [Zach] Yeah, I think
those are all great examples as well. One thing that we’ll discuss that I’ve
seen is that often with what you see with an administrative data set is that
there’s certain variables that are really critical to the administration
that are higher quality or well populated, limited missing data but then
other variables that you’re interested in for policy research may not be as … have
more missing data or may have quality concern just because that they’re not as important for administration. Okay great, thank you both.
We’ll go back to the PowerPoint slide. All right, hopefully everybody can see that again. Okay, so going to jump into a lot of the themes
that teams brainstormed and go into a bit more detail as well. So
the key thing about administrative data as we discuss, is that the researcher, the
statistical agency doesn’t control the data collection. In some instances, I
think would be pretty rare, there might be some feedback that can occur between
a statistical agency and the agency administering the program but I think
that’s the exception to the rule. Bob Groves, as some of you may know,
was the Director of the Census Bureau a while back, has drawn a neat distinction
some recent writing, a helpful distinction between found data and
designed data. We can think about surveys as designed to have the properties that are
needed to support inference. Still there are many challenges that the data have those
properties. But where administrative data would be found data, they just come as
they are, and they’re valuable but just because of their nature as found data
there are some limitation. The data were collected for the purpose of the
administration of the program. We discussed that certain variables are
important for the administrative need, others may not be. Another particular
thing about administrative data is that they are … can be affected by legislation
or could need to respond to legislative requirements. So an example would be tax
data, IRS data as being a very commonly used in policy research and then found
to have some good properties relative of survey records. It’s pretty challenging
for some somebody to look up, for a survey, details about their financial background
that can be just taken directly from tax records. But at the beginning of
this calendar year we had a new tax law passed which will mean a change in the
form 1040 that’s used to collect a lot of that data. So that means that there are
challenges to understand that transition, what’s coming in with the data and work
needs to be done that can possibly be overcome to understand … to adjust to those
changes in the form but without really careful consideration you could have a
break in theories that could make 2016 and 2017 data, IRS tax data not so comparable with 2018 or 2019. For commercial data, they’re
carried for the purpose of a business need. They’re often packaged and sold to other companies where a statistical agency or researchers are
not the main target for the product even though the researchers recognize the
value of the data. So there are certainly risks there of the data not being
collected for statistical uses. Two aspects that are related that are
challenges with administrative would be, the documentation and the data cleanliness,
which we’ve talked about a bit. Surveys have very strong documentation to
support data users. You can get details on all the aspects and steps that’s from
the data collection to the final data set. On administrative data set, your
documentation will tend to be much more limited and often the process or how the
final data set was created and curated can be a little bit mysterious. Administrative
data and commercial data may have a smaller set of initial users who are managing
the program, there’s some internal knowledge about how to use and
understand those data sets which would mean that there’s less work into the
documentation. Even seeing examples where there can be a detective work needed to
find out what variables represent you might have a limited codebook but that
codebook really doesn’t have the full detail to understand what exactly those
variables mean and what they’re aiming to measure. For similar reasons, the data
may not be very clean. I’ve seen examples where there are variables that
get swapped in the data, so for a certain time period you have two variables
switch with each other and you would need to do, have some clever
investigation to discover that when you see the puzzling change in trend, that
kind of thing can happen. It certainly can be hard to compare
data that’s aggregated from multiple agencies: state agencies, local agency.
So how are you going to structure the data to make sure that the data formats
are similar when you bring data from multiple agencies together? And again, if
there’s only a few initial users at the agencies who are relying on certain
variables being higher quality but not paying attention to the whole data set’s
abilities to support policy research, you can’t always count on the data to be as
organized as for survey data. This table represents a pretty thorough list of
data quality elements that could either apply to survey or administrative data.
They’ve been informed by literature reviews and experiences on different
projects. Some of this comes from literature from fiscal agencies,
particularly in Europe where statistical agencies are using
administrative data pretty broadly, sometime in place of censuses and surveys.
They’ve done a lot of good work in documenting what are different data
quality aspects and how would you go about assessing these in administrative
data? So we’ve grouped them into related concepts. In this table, five different
groups: accuracy has many different aspects and certainly requires careful
assessment. It would refer to both [inaudible] the data input and can you trust that
you have the correct values in the data? And then the data output to support
producing estimates and statistical inferences that are approximately correct. So the different aspects would be
measurement error, whether the values that in are in the data source are the
true values, whether they are correct. In different situations you can imagine
that can either be a strength of administrative data or surveys, so we’ll
revisit that point. There’s a lot of steps for processing to arrive at the
final data set and any of those could have decisions that introduce error or
mistakes in the processing. So processing error is an element that could
potentially affect both surveys and administrative records. Construct
validity would reflect that the topic measured is the one that is intended to
be measured for the research. So for surveys, we might only get indirect
measures to reduce respondent burdens. So revisiting the example of measures of
financial background, you might ask a simpler question in order to reduce
the respondent burden and get an approximate concept of what you’d like
to obtain, that could be a challenge there. In administrative data, you don’t
get to, whereas in survey data, there’s a lot of thought in how you phrase a
question, what you collect, in administrative data it’s more that you get what
you get. And then in aspect, we’ll also revisit later, is external validity. Are
your data representative of the population? You want to avoid certain
kinds of cases being missed by your data or having cases that are out of scope
included to avoid over-under representation. Relevance, this is an important component.
You have a policy topic that you’re interested in and making sure that your
data supports studying that specific topic. There are factors related to
timeliness and time related factors, those came up in our discussion. You want
to have the data available in time to inform policy decision. But a couple
other ways that time can play in, so surveys typically rely on a reference
date that can allow for comparability. So for the Census though, you respond
based upon your situation as of April 1st of the Census year or for a
survey you might answer about your employment situation as of a certain
date. Administrative data may be collected from different time periods
and that can make it difficult to reconcile and analyze your data.
We’ve grouped accessibility, clarity, and transparency together.
Accessibility would refer to the conditions in which the data can be
obtained and analyzed: where, how, and the general availability of that data to users. Clarity would refer to that the data, or
the statistical information or resulting estimate are available in a clear and
understandable format, documentation is a critical part of this. And
transparency refers to the methodologies for producing the data,
arriving at the final data so the final estimates are clear and that
understanding that to know how to interpret your data or your ultimate
estimate. Again, you usually get a lot of detail for surveys but sometimes you’re
limited when you’re using administrative or commercial data. A final group is
coherence and comparability. Coherence would refer to the ability to combine
your data or statistics with other statistical information that’s available.
That can refer to record linkage, linking keys that can also refer to
combinations of data sources and estimates beyond record linkage. So do you
have the appropriate information and context for the estimates to bring
together data or estimates from different kinds of sources? And
comparability would refer to if your data collection practices and measures
have the same processes across groups you want to compare, across geographies,
or over time that allow you to make good conclusions about comparisons between
different groups. Some recent frameworks have found it really helpful to think
about sources of error for administrative data in terms of the rows
or the entities, households, persons reflected in the data set, and the
variables and the columns. So thinking about the data set in terms of rows or
columns. Now not all data is structured in such a way but often we can think
about the data set in this flat format of rows and columns can help us
to think about the issues. Some context is there’s a lot of literature, I’m sure
many of you are familiar with, for total survey error frameworks that break down
sources of error in surveys into a different components. One would be
sampling errors, so that would be related to having a subset of the population
that would typically be quantified by a standard error on your estimate or a
confidence interval. And then there are non-sampling errors which would come
from missing data, coverage errors, measurement errors that contribute to
error for the ultimate estimate. Though more recent work has adapted this to think
about what the issues are for administrative data, which are in somewhat
different context. Some of this is based upon a discussion from a new textbook:
Big Data in Social Science. Statistician Paul Biemer has a nice framework he lays
out for either administrative data or other sources he calls Big Data Source.
So I’ll start with rows, and you have a population of records where the
rows represent: families, households, NSECE with child care providers, individual.
There are some records that are omitted that should be in, so under coverage or
omission. There are some records that might be included twice, those would be
duplication. And some records that are included wrongly, those would be erroneous conclusions leading to over coverage.
Administrative data are different to think about than survey data. In survey
data we sample and then we think about design and weighting to have generalizable
inferences. In administrative data, typically you are thinking about administering a
program, for example, and you think about having all the records for cases
receiving a benefit or benefiting from a program in your data. For columns and
variables, we have validity error which would … the terminology used for construct validity. Do you measure what you intend to measure? Then there’s measurement
error or inaccurate values, and then there may be mistakes in data
processing, and finally missing data. Now you can think about these and some
examples that could affect a whole variable. So a validity error, the ways of
the concept is measured is incorrect for the whole variable or it could affect
just specific values within a variable. I want to dive a little bit deeper into
the critical dimension of accuracy and discuss some of the specific issues for
administrative data. Again, we might think about administrative data as strong
in terms of coverage, so for the administration of a program we want to
think about the examples of SNAP food stamps or TANF (Temporary Assistance for
Needy Families), there you should have a record for all cases receiving benefits.
So in a sense you wouldn’t expect to have missing cases in your data but
there are instances where you might be interested in a certain group for policy
where that group may be systematically missing from the data set, the group may
just not be needed for administration of the program even though you’re interested for
policy. So you need either a strategy to handle that or recognition of a limitation of that element of the
data source. Now when you have missing records in households and families, and in
administrative data there’s issue of what exactly does that mean? The fact of missingness may be important. For example, indicating ‘not receiving
benefits’ or it could be a different issue and a data error. And
sometimes you see a record missing that you expect to be there, from linkage
to the survey it can be hard to figure out which is the case of it’s a data error
or reflecting a change in receiving benefits. Examining TANF records, again
Temporary Assistance for Needy Families, it can be hard to separate a case if you
see a case have an interruption in receiving benefits. It can be hard to tell if
that case no longer was eligible for benefits for certain periods or if there is a
program sure an administration issue for that case just missed benefits for a
period of time and that will have a implications for the ultimate policy analysis you want to do. The design data take great care to manage but they can be subject
to coverage errors. Typically you’re scraped on statistically thinking about
methods to address those coverage issues. So that’s a contrast from administrative
data where again, you get what you get. So when
these kinds of issues arise should be concerned about systemic bias
and the inferences you draw from such a data set and the external validity of
the data set. Measurement error then could be either a strength or a weakness of
administrative data source. The surveys are subject to recall bias, cognitive
issues, further respondent … respondent burden that people have limited time and
that can lead to completing a survey in a rushed way and lead to errors.
So that may be an advantage for administrative data, in some cases.
Administrative data also come from a ground truth measure. Comes from one
single data source and doesn’t depend upon several respondents to look up the
information or determine it on their own. But again, this issue for administrative
data is that they are not collected for statistical purposes. That the information may
not be clean, so you need to do checks to see if you can trust the numbers. And
again, the number that’s needed for administration may be different from the
number that you’re actually interested in for policy research. So now we have
a good understanding of what different elements and aspects of data quality are
that you want to assess. So what are some ways that you can go about assessing
them? The researchers, I think, the European Statistical agencies are
certainly leaders and have suggested different measures to help assess data
quality. Generally, I think often these are indirect and many of these measures
have some weaknesses. Your very best situation is to have some kind of gold
standard data set, that you trust that you can compare to for assessment. But
often that gold standard data set doesn’t exist or is hard to obtain. You
can … when you have that gold standard, you can either see if you can link at the
record level to verify or you can look, compare resulting estimates to estimates
that you trust. But often there is some amount of guesswork in assessing
different dimensions of data quality. So here’s the table that presents different
measures that reflect the accuracy dimension alone. One would be
item non-response. So for a specific variable, what’s the extent of missing
data that could lead to biases or any inferences for which you use that variable? Is that missing instance related
to key variables that you’re interested in? Under coverage. So can you measure an
extent to unit that should be included in the data set but aren’t? That could be
a limitation is that if you link to a survey and you think some records
should be in the dataset that aren’t, that could provide you a measure of under
coverage. And over coverage would be if you can detect duplicate or detect
records that are out of scope and quantify that. Or can, if you are able to,
can you successfully identify them and remove them? But you’re not always able
to do that. We discussed time factors that are related to accuracy. You may have
some indication that the reference period for some cases differs from other
records, often you do get information on the time period from which data come.
So and finally, you would expect variables to have to fall within
certain ranges particularly of a code book, or you expect certain logical
relationships and other variables. So checking rates of sales for these
relationships would be a good indicator that you can trust the data and an
extension of this would be to do use methods for outlier detection. Even
though something is of value psychological it seems unexpected and
you might check for the prevalence of those kinds of outliers to make sure
that you can trust the administrative data source. This is a table that
has a more comprehensive list of indicators for the different dimensions.
Typically these are pretty simple ideas that can be assessed without doing a
substantial analysis. So many of these are very helpful and quite possible to
do. This shows many aspects of data relevance that can quantify so you are,
rather you can assess if you want to look at the definitions of units and
variables to make sure that those are the ones that are most relevant for the
analysis that you want to do. What was the reference time period, do you have the
domains in the groups of interest for policy research, and what’s the
thoroughness and scope and the richness of the data to do the analysis that you
want to do? Accuracy has been pretty well covered so a lot of similar themes of
what we’ve discussed before so I’ll skip over that. Timeliness and measures
consider, could encompass how frequently the data are updated so that for policy
if you want to track changes over time that the data are updated frequently
enough. We discussed the time for the data delivery. Administrative data
sometimes take time for an agency to process and deliver so you want to
consider that you can get the data source in time to inform policy decision.
And a different aspect of that is the uncertainty of when the data arrive. You
need to have some certain, there can be variability in when that the data are
available and it’s good to have indication that, you know, the data will
arrive in time. You can assess comparability over time. So in some work
we’ve done here at NORC us looking at data quality. We found that very
simple time series graphs can reveal unexpected data patterns, adjusting data
inconsistency. You see I have a jump in the time series for a certain variable
that suggests an issue that you wanted to address. You can take a parallel
idea and apply to cross geography or cross group comparisons. As you see you
expect some differences of course but if you see something that’s unexpectedly large,
that could suggest an issue. This does not provide specific indicators for
availability and clarity here but certainly we can reach out to the data
provider for the documentation that’s available and see if that documentation
is rich enough to provide you what you need. How the data are presented
and making sure that you have a good way to access the data. There’s not as much
literature on the use of data visualization for [inaudible] quality of
administrative data but here at NORC we found that data visualization can be
incredibly helpful. Multivariate data visualization may be a little bit
complex at first but it can allow for gathering many different kinds of
variables and assessing the patterns in aggregates to suggest where there may or
may be concerns. So this is something that I would suggest considering. Some
simple things can be done so looking at distributional plots for your variable,
densities for the distribution. This shows an example of a powerful plot we
found on the table plot. The idea here is to pick a sorting variable. So in
this example it’s age. It could be time or any other key variable in your
data set. And then the other columns reflects other variables and they are
sorted by that for sorting a variable. The y axis if you will, is percentiles of the
age distribution. So this example shows a different categorical variable. So by
these different groupings of age it gives you the distribution pattern.
From this example you can see gender in the third column, marital status and the
fourth column, can see changes in those variables over age in the dataset which
is a sensible result that you would expect. So this here, this example I
provide is more exploratory, less related to data quality. What we found is
you can find strange changes in the distributions by using
this kind of plot that can suggest an issue that requires further
investigation. These are all categorical variables in this plot but you can look
a continuous variable [inaudible] as well. So if anybody wants to try this I think
the R package is called table plot. And can track continuous variables and
looking at the spread of the distribution and changes in the mean or
the center of that distribution. Okay now to set up our second exercise today.
Write an example from some data quality analyses from my own research. So
as a Census Bureau dissertation fellow I did work to look at commercial
property tax data that was available from CoreLogic. The Census Bureau is interested investigating whether administrative
data sources could be used in conjunction linked to the American Community Survey,
provide information for housing statistics even if the information from
other data sources was high quality enough to remove a question from the
questionnaire. So I did analyses on links CoreLogic and ACS data largely focusing
on single family home. To assess the coverage of the data set by linking to
the 2010 ACS and finding the percent that were matched. And we found, first
looking at state coverage, there were three states that were entirely missing from the data set, several others that were
pretty low, while there are other states that were pretty good- they had 70% or
80% or higher. And we used the same data and did a
similar analysis but by some key characteristics of the data set. We found
some major differences in the coverage of the CoreLogic data. But particularly
what stood out was that rural households were underrepresented, they were just
not as often available in rural areas compared to urban areas. There are
differences by socioeconomic status as well but not as stark and we think that
this was probably reflecting no other key patterns in the data.
So the exercise, based upon looking at this kind of data quality analysis, to
think about how was … if you were to do an analysis plan using housing data of
studying housing characteristics from CoreLogic, how does this kind of result
impact your analysis plan? And I want you to think about two different kinds of
situations. One, where the CoreLogic housing data are objects or dependent
variables in your analysis. And second, as if they are covariates or independent
variables in an analysis that’s focusing primarily on other variables. So again,
break in to the couple of groups. I think will do eight minutes for this
exercise and think about how these results would inform your analysis
plan under each of these scenarios. So again I’ll give you about a five minute and about
a two minute warning. [Rupa] Zach, can I ask a clarifying question? On your [indistinct]? So that is … the denominator is the ACS household and you’re asking how many of the ACS
households are found in the CoreLogic data or …? [Zach] Yeah that, yes that’s exactly
right. So this reflects partly challenging linkage and partly data that
are missing from CoreLogic. All right, so go ahead and break into your groups. [pause for group work] Five more minutes for each team. Thanks. Last couple minutes for each team.
You’re free to take your time with that but in case the teams wrap up earlier or …
good to have ready with their thoughts to share, you can indicate that in the chat box. Thanks. [pause] I think we should go ahead and regroup.
I’ll stop sharing the slides for a second. So we’ll go in reverse order for
this one. So if I could have a volunteer from Group number 1 to share their thoughts
on informing their analysis plan based upon the two different scenarios. [Tingting] Okay,
I guess I was nominated to be the speaker again. So I’ll try to answer this
question. So our group, in our discussion is primarily around the consequences of
the missing data values if it were considered as a dependable variable
versus a covariance. So I think we all came to the agreement
that it would be more severe, the issue would be more major, if it were
considered as a dependable variable because this is such a large data sets.
So if it were included as a covariant, we can possibly use other
variables and then also we can impute the data. It might be able to
enhance the estimates a little bit in that way. [Zach] I think that’s great. And then
Team number 2 a representative to talk about thought to add to that for the
analysis in the two situations. [Danielle] Hi this is Danielle. A couple of things that we
talked about were issues with validity around the rural areas and also issues
around some of the states not being represented in the data. And then we had
a couple of questions about like, “Is there a threshold for what you would
consider a good match? Is like 70% a good match and below that is not, or is there
any kind of threshold like that?” And then our other question was just trying to
think about how you could deal with some of that missing data. [Zach] Mm-hmm, yep I think
those are … I think those are right on point and a very good question. So when
you are … let’s [indistinct] for a second. When you are analyzing data and
figuring out a threshold that’s appropriate, it’s hard to see if there’s a
hard and fast rule. But when … one thing I would strongly suggest, and I’ll discuss a bit more
on the next slide, is you can do a sensitivity analyses. You can kind of see
based on the extent of missing data if you, let’s say you have you allow for a
range of possibilities that are different extremes, you kind of see if
the result is robust in those different situations. Certainly if you
are using the data as a dependent variable, that’s going to be really
challenging so I’ll pull the slides back up. So what I would suggest if you have, if
you’re using the data as a dependent variable is you would focus on where the
data set has a strength. So you may need to, if you have the initial hope to
study the entire country for that dependent variable you may need to limit
the scope and say well we can’t say anything about Vermont where there’s no
data and no statistical method is really going to allow us to make a conclusion
about Vermont but there’s other data set, there’s other states where we can do
a good analysis and even though we have not hit a 90 percent threshold, no we can
use statistical methods do a compensate for the non-linkage there. And then if
you’re just using the data as an independent variable and you don’t have
a strong requirement of your data source so there I would definitely advocate
using statistical methods to adjust. So if you have missing data how many
people are using combined, linked administrative data and survey
data and using imputation to fill in where the administrative data are
missing. You might also think about if you can have an option to use either
administrative data or survey data in different geographic areas. You might
rely on one or the other more in different geographic areas to draw your
conclusions. But we focus but play to the strengths of your data and focus
on where the data quality is the strongest and recognized the caveats
that you have for your analysis. Other points that I want to make about this slide? Yeah, so I think your range of options is
really to have limit your analysis to where the data quality is strongest.
Recognize you have two different data sources and you have different strengths
in terms of what they measure, that the administrative data source may
be really strong where you have the data available but you can’t use that in
other areas. And then using statistical adjustment, imputation, weighting to
compensate for the weaknesses. Recognizing though that the statistical
methods rely on assumptions, sometimes strong assumptions, that where there are …
if we take the example of missing data, that with the relation where you have
missing data, where the missing data are the relationship with the variables
you’re interested in is the same for the missing data the data you observe. So
that’s why I think sensitivity analyses can be really important. Continue with
the missing data example, you could assume that where you have missing data,
the values of one extreme would be fairly low or fairly high taking an
example of a continuous variable. Do you kind of check the robustness of your
policy conclusion, your statistical inference, the more extreme scenarios and
your policy conclusion holds up then you can make a really strong argument for
your research. Or your result isn’t robust to all different situations
and then you have a caveat to your conclusion for a future research that
needs to be conducted. The last point in discussing assessing quality of
administrative data is before you do your research you want to have the best
sense of the data quality you can to get at feasibility of the analysis and
whether the data quality is going to be strong enough. So what are the things
you can do before conducting the research to evaluate the feasibility of
your project and the strengths or weaknesses of the administrative data
source? So very often other research has been done with the data but I recommend
doing a detailed literature scan and that can be a great resource for
understanding the quality to expect. What kinds of analyses were people able
to do in that research, what were the limitations, what did they note about the
data quality? All that is extremely valuable. You can contact the data
provider, ask if they have a codebook available, or what documentation
is available? In different situations you could get different levels of
helpfulness from the data provider but that’s strongly recommended. It’s always
good to have some level of flexibility with your research plan and be prepared for
some surprises. It’s hard to anticipate what it will be like to work with the data set
until you have it, at least in my experience. So that’s recommended as well but definitely use the resources you have
available to do your research before proceeding with your project and putting
your proposal together. So in the last part of today’s talk we’re going to talk
about some of the administrative aspects of a linkage with administrative data.
Particularly as researchers we have an ethical responsibility to the subjects
represented in our data and the families, individuals, households, providers in the
data to protect their privacy and their confidentiality. It’s also important to
be aware of the legal requirements for working with that data
and the issues that can arise. It’s possible in some situations to be
subject to large fines if you violate requirements for data privacy. And not to
scare you, but there are situations where jail time is a possibility. So understand …
understanding the requirements is important. They’re typically laws
provide guidance and restrictions for how you can access the data so
knowing this at the outset can help you assess the feasibility of your project
and that it’s going to work for you. Some of this is an example with my project
analyzing the CoreLogic data and the linkage to the American Community Survey, was
about a six-month process to obtain access to that data. I conducted at Census
Research Data Center, the secure environment for analyzing these
restricted data. The process involved being fingerprinted, giving a sworn
statement to a notary public, working in a room with no internet access, having
output reviewed for statistical disclosure risk. I would say that’s at the extreme
end of what’s required for data linkage projects but is a kind of example of
what might be required. So when you develop your research plan make sure to
account for this process. So data use agreements are very often required for
these kinds of projects. So the data use agreement is a legal agreement but
the data producer has the responsibility to provide it but as data users we also
have responsibly to find out if completing what is needed to do the
analysis. What it does is to transfer the responsibility for protecting the
privacy and the confidentiality from the data provider to the data user when the
data user is undertaking that analysis of the data. Can be a different
legal requirements based upon the area of application, different laws can apply.
If you’re analyzing health data, HIPAA would probably be the main one. FERPA,
applying to Education data. The Common Role, applying human subjects research.
And I was working with Census Bureau data, Title 13 applied. There’s particular
rules for working with IRS data as some examples. When do you need to get a
DUA? Almost always necessary, I would say unless the data are in the public domain.
Usually have one question after this morning’s lecture of somebody who was
using a geographic linkage. So linking to the ACF estimates at the zip code
level and almost certainly the kind of thing
that’s okay because those are published estimates that anybody can obtain and
link at a geographic level. So that kind of thing would be would … would not have
these sensitivity issues of doing person-level, household-level kind of
linkages. But generally the default is that you should expect to find if one is needed
if you’re working with restricted data. And I would say almost certainly for
every file that has personally identifiable information (PII), like Social
Security numbers, dates of birth, things like that, protected health information (PHI),
and education records at a minimum. These are all considered
sensitive information under these different laws. So what goes into a DUA?
It would contains the rules and restrictions on how the data products
can be used, how the data can be transferred, to make sure that that’s done
in the secure way. The rules for disclosing results and presenting the
data, what’s required on-site for keeping the data secure? There’s
certainly, in some instances, possibilities of folks from a government
agency administering a program doing visits to make sure that you’re
following all the procedures to use that data. And then it would lay out the
penalties for non-compliance with the aspects required for using that data. So as discussed with responsibility for
data use agreements, and so most of you I think would be the data users. So you
should contact the data provider to find out what’s involved with completing a
DUA for your potential project. Dome of you, I think are … might be data providers
as well or initiating that you may be already familiar with what’s required or
learn about what’s involved with putting together a DUA if you’re sharing that
data with others. Conclude with discussing informed
consent. So when we conduct a survey, engaging with that respondent is
providing that data on the web, filling out a paper questionnaire, doing a
telephone interview does that entails explicit permission for that respondent’s
data to be used when that’s collected? For administrative data that’s not the
default. The data obtained from an agency, usually unknowingly to the individuals
or entities represented in the data, so it’s recommended to obtain informed
consent when linking a non-survey data with survey data for research use and
sometimes legally required. Informed consent in general for research broadly
would be voluntary agreement to participate in research. So
the linkage aspect would be permission to link the data to the survey data
for research purposes. And motivation as researchers and for government agencies
we rely on trust from the public and how we use their data and should
respect the participants given the self determination how their data are used. So
informed consent is important. Now important point is that informed consent
is something that should be accounted for in analysis. I won’t go into detail
about all the statistical aspects of this but informed consent rates are
pretty much never a hundred percent. They’re usually related to some
characteristics that are of interest to study so if you ignore it you could get
biased results. I think later lectures are going to discuss statistical methods for
adjustment in these kinds of situations. I think imputation and weighting would be
the classic kinds of method. One neat example is there’s a paper on linkage of
the Child Care Development Fund data to the NSECE. Rupa and
Carolina were involved with looking at the consent decision and find the
differences between consenting and non-consenting households. So that’s
that’s an example demonstrating differences in [indistinct]. [Rupa] And I’m just going to
interrupt to say that what we found was that in Illinois there were no systematic differences between consenting and non-consenting. [Zach] Okay. [Rupa] That’s a big plug. (laugh) [Zach] So, I’d say we’ve covered the aspects of understanding administrative data. So the
next section, we’re talking more about data linkage. What were the
different analytic purposes for which you would conduct data linkage? And then
assessing project feasibility as well. In terms of your project assignments, some
questions that you would consider for that assignment based upon today’s discussion. In
terms of administrative data what are the quality considerations that are important
for understanding your administrative data source? How will you evaluate that
data quality? And what are some of the limitations in your plan for analyzing
data quality? And then questions related to planning for the access requirements
for conducting your linkage and analyzing the data, if you are the one
conducting the linkage for the project, or the permission processes and access
requirements. Who would be the one to provide access to the different data
files? What legal agreements do you need to complete? Will the linkage be
done by you or by a different agency? And where will the analyses be conducted and how?
And then how long would the process take to account for it in your planning?
A reminder to discuss any questions from these lectures or if there are
questions related to your projects you can go to Canvas and register for
office hours. We have some coming up tomorrow afternoon as well as on Tuesday,
I believe. And some of the references I mentioned today. Some of these are good
resources for thinking about administrative data and learning about the topic more deeply. So I recommend referring to some of
these for more on some of the topics we discussed today. Okay, so we have a bit of
extra time. So any questions about today’s contents or about the course in
general. People can either unmute their microphones or type
questions in the chat box. All right then. I’ll stick around for a
few minutes if there’s any last question and thanks so much. We’ll see you
tomorrow and Carolina will be giving a lecture.

Leave a Reply

Your email address will not be published. Required fields are marked *