Every summer, the Biocomplexity Institute’s Social and Decision Analytics Division’s Data Science for the Public Good (DSPG) Young Scholars program draws university students from around the country to work together on projects that use computational expertise to address critical social issues faced by local, regional, state or federal governments. The students conduct research at the intersection of statistics, computation, and the social sciences to determine how information generated within every community can be leveraged to improve quality of life and inform public policy. The program, held at the University of Virginia’s Arlington offices, runs for 10 weeks for undergraduate interns and 11 weeks for graduate fellows who work in teams collaborating with postdoctoral associates and research faculty from the division, and project stakeholders.
The 2019 cohort conducted nine research projects, and their methodologies and discoveries will be presented at MethodSpace over the next three weeks as part of our examinations of Methods In Action. The descriptions of the projects were penned by the students themselves, and their names, mentors and sponsors appear under the DSPG logo in the text.
Burning Glass Technologies (BGT) is a Boston-headquartered labor market analytics firm that uses artificial intelligence to collect and host a massive repository of job-ad and resume data. The job-ad data have been compared against federal and state surveys by researchers who have used it to track job-ad skill changes for occupations under various economic conditions and geographic locations. In contrast, there has been no published research — outside of BGT publications — that uses their resume data.
Our research uses BGT resume data to explore the pathways to a non-degree credential (see note) for job seekers with less than a bachelor’s degree that will provide them with the skills necessary to secure a job in the skilled technical workforce. A job in the skilled technical workforce meets two criteria: a high level of knowledge in a technical domain and is open to workers without a bachelor’s degree (Rothwell 2016). This brief shares our work profiling the data and deriving a maximum education variable that we can use to classify each resume into one of two categories, less than a bachelor’s degree or bachelor’s degree and higher. We evaluated our methodology on the resume data from two metropolitan statistical areas, or MSAs, in Virginia.
The resume data are proprietary to BGT and sourced from a variety of BGT partners, including recruitment and staffing agencies, workforce agencies, and job boards. This proprietary data set contains 23 million unique resumes covering 100 million jobs over 2002–2018. The resume data are in the form of relational tables connected with a unique resume ID. The five associated tables are divided into Candidate Information (9 variables), Skills (8), Jobs (9), Education (12), and Certifications (4).
Data profiling included the metrics: completeness, value validity, consistency, uniqueness, and duplication. This brief only discusses the variable metric completeness, calculated as a percentage of the number of observations that have values compared to the number of observations that “should” have values (NA=not available values are not counted as a valid value); and only the variables used in our research. Our completeness results are displayed in the table for 32 of the 42 resume variables.
The Skills table was 100 percent complete for the canonicalized skill names for both MSAs and included skill cluster families and skill clusters for approximately 80 percent of the skills. In contrast, only two of the nine Education table variables were 100 percent complete in both MSAs and eight were less than 60 percent complete, the majority in the Richmond MSA.
Additionally, many of the Education table entries were not canonicalized. Unexpected strings of characters such as phone numbers were sometimes found in unrelated variables. Multiple degree types were sometimes combined within a single row (e.g. if someone received two degrees from the same university, a row for one resume might include under Institution “University of Virginia” and under degree field “MA#BA” or “BA#BS”). Since the focus of our research is on pathways to the skilled technical workforce, it was necessary to categorize each resume as having or not having a bachelor’s degree or higher. A job in the skilled technical workforce is one that requires a high level of knowledge in a technical field but does not require a bachelor’s degree or above. The methodology for deriving this variable is described in the next section.
Results of the Data Profiling the BGT Resume Data for Completeness for Two Virginia MSAs (2016-2018)
Identifying Individuals’ Maximum Education
For our analysis, we were interested in distinguishing between individuals having or not having a bachelor’s degree or higher in two MSAs in Virginia. Education data were distributed in two tables, the Candidate Information and Education table. The Candidate Information table included a variable for the number of school degrees. This variable ranged from zero to values as high as 41, and also included character variables, an indication of a value validity issue (i.e., values whose attribute possess values not within the range expected for a legitimate entry).
The Education table is ordered with a line per degree that includes the unique BGT resume ID (each resume may have multiple lines in the Education table). Each educational line includes a variable for the number of years of study for each degree (e.g. “12” indicates completion of a high school degree). However, this field was 52 percent incomplete for one MSA and 42 percent for the other.
Since we were interested in determining which individuals had at least a bachelor’s degree, any individual with the number of years of study (referred to as Degree Level in the Education data table) higher than 14 (i.e., associate’s degree) was considered to have a bachelor’s degree. This allowed us to classify 68 percent of the combined MSA resumes into having or not having a bachelor’s degree or higher.
We performed exploratory analyses to increase the completeness of the derived variable. We checked the related variables, name of the education institution, major (CIP code), and degree type for information that could be used to develop a criteria list for classifying the remaining resumes. Any education record that met any of the below criteria was tagged as below a bachelor’s degree.
- In the “Institution” column, we flagged:
- Virginia community colleges (none of which offer bachelor’s degrees)
- Trade schools (i.e., culinary schools, cosmetology colleges, and other vocational schools)
- High schools and GED programs
- For “Major”, we flagged values that indicated partial college completeness (such as “some college” or “coursework in”), as well as strings like “GED”
- For “Degree Type”, we flagged values that might indicate an associate’s degree, like “associates” and acronyms like “AA” and “AAS”, as well as degrees like “High School Diploma” or “GED”
- In these cases, we standardized text by removing capitalization and punctuation (e.g. “Ph.D.” vs “PHD”).
We then aggregated the education records of all individuals, seeing whether any individual had a linked education record for a degree beyond an associate’s degree. With this methodology, we identify the bachelor’s degree status of 79 percent of combined resumes.
Conclusions and Future Research
The BGT resume data are a rich source of information on job candidates, which we began to explore in this paper. However, our specific research questions pushed us to augment the existing resume data.
These methods are specifically focused for our research on the skilled technical workforce in Virginia and will need to be altered for other contexts. For instance, community colleges in other states may offer four-year degrees. Additionally, the resume data includes several other tables that we have not addressed in this paper, such as job histories and certifications, which may offer their own challenges and opportunities.
We hope that our methods in the Education and Skills table will be of use to future researchers as they take advantage of this rich data source.