Every summer, the Biocomplexity Institute’s Social and Decision Analytics Division’s Data Science for the Public Good (DSPG) Young Scholars program draws university students from around the country to work together on projects that use computational expertise to address critical social issues faced by local, regional, state or federal governments. The students conduct research at the intersection of statistics, computation, and the social sciences to determine how information generated within every community can be leveraged to improve quality of life and inform public policy. The program, held at the University of Virginia’s Arlington offices, runs for 10 weeks for undergraduate interns and 11 weeks for graduate fellows who work in teams collaborating with postdoctoral associates and research faculty from the division, and project stakeholders.
The 2019 cohort conducted nine research projects, and their methodologies and discoveries will be presented at MethodSpace over the next three weeks as part of our examinations of Methods In Action. The descriptions of the projects were penned by the students themselves, and their names, mentors and sponsors appear under the DSPG logo in the text.
The Data Science for the Public Good program is a part of the University of Virginia’s Biocomplexity Institute and Initiative, which aims to identify, visualize, and understand the full complement of issues that impact the public good. Working with our sponsor, the Army Research Institute for the Behavioral and Social Sciences, our team utilized a data science approach to examine community embeddedness in two states, Oklahoma and Virginia, and then focused on counties surrounding an Army training installation in each state. We developed a composite index to measure and compare community embeddedness across counties.
Community embeddedness is a measure of how hard it is for someone in a community to leave it—expressing the underlying economic, social, and health factors that tie residents to their community. We created a composite index (a single value that summarizes many underlying concepts) approximating the community embeddedness in a given place.
Community Embeddedness Index Creation
The data sources used to create our community embeddedness indicator include publicly available datasets from County Health Rankings, the American Community Survey, the Annie E. Casey Foundation (Kids Count data), and state voting agencies. For this feasibility study, two states are used – Virginia and Oklahoma, and within these states, communities that surround two Army training posts: Ft. A.P. Hill in Virginia and Ft. Sill in Oklahoma. Community embeddedness is defined here using the Social Determinants of Health framework, which emphasizes:
- Neighborhood and built environment
- Health and health care
- Social and community context
- Economic stability
Based on these categories, we selected and analyzed 63 relevant variables, first to remove redundant variables and then to identify the most essential indicators in our model.
The first step was to remove variables that had over 20% of their data missing. Using a correlation matrix, we identified the variables most related to one another by considering which were qualitatively linked together (e.g. percent that drive to work, percent that walk to work, percent that work from home, etc.) and chose the conceptually broadest variable as the representative of that set. After this process, 25 potential variables remained to define community embeddedness.
To construct the index, the remaining 25 indicators were analyzed using linear regression techniques with the percentage of people who stayed within a county over the past five years serving as a proxy for community embeddedness. Two methods, best subsets and lasso regressions analyses, (which show the most explanatory variables in a regression model) further reduced the number of variables. Best subsets regression yielded the 10 best predictors of our proxy:
- median household income
- ratio of household income at 80th percentile compared to income at the 20th percentile
- percent of population in rural (defined as non-urban) areas
- percent of workers 16 and over who drive to work alone in a car, truck, or van
- percent of total population identifying as white (either alone or in combination with other races)
- average daily air pollution (measured in particulate matter—PM2.5)
- total population for the county
- index of residential segregation non-white and white races
- mean commute time for workers 16 and over
- percent who voted in the last gubernatorial election.
We then scaled the values for each predictor and ran a multiple linear regression, using the coefficients from this model as index weights for each of our indicators. This decision was validated by an exploration of Principal Components Analysis and Factor Analysis. These analysis techniques are commonly used for variable selection and produced similar models as our regression but did not fit the data as well.
Finally, we scaled the embeddedness construct to create our index. The weights and relationship of the variables are presented in the mosaic plot.
The single largest contributor is median income, which negatively contributes to the index, (i.e., counties with higher incomes tend to have lower levels of community embeddedness). Other factors that contribute to embeddedness are illustrative of the community’s setting (rural indicators: rurality, driving to work; urban indicators: air pollution, segregation). Finally, there are factors that describe the people within a community, which may influence their ability and willingness to leave (income, population, percent white, commute time, voter turnout).
Using the Index
To look at how the presence of an Army post might be linked to community embeddedness, we focused in on two specific counties containing installations: Caroline County, Virginia (Fort A.P. Hill) and Comanche County, Oklahoma (Fort Sill). These locations were selected because the Army posts have similar training missions and their home counties, as well as all surrounding counties, share similar rural properties. Comparisons can be drawn between the counties that possess Army posts and those that do not.
Comanche County, Oklahoma has lower community embeddedness than its neighboring counties. Caroline County, Virginia has similar embeddedness from its surrounding counties. Our research demonstrates the feasibility of constructing a community embeddedness index and is a first step to measuring how the presence of an Army post might relate to embeddedness. Future analyses should also explore other reflective measures of community embeddedness beyond migration patterns, as the value of an index is typically tied to its ability to predict multiple outcomes (e.g., number of public-private community partnerships, election turnout).
Examining community embeddedness in each state also provides interesting results. For example, Southwest Virginia has high embeddedness indexes, which is not surprising given how community embeddedness is defined here. This part of Virginia is in the heart of Appalachia and the community embeddedness index reflect the lower incomes in this part of the state.