Categories: OtherIn part one of this series we introduced the topic of automated image tagging and showed how cloud vision APIs such as Clarifai can be used to classify images into different categories. We showed examples of SAGE images and the tags assigned by different cloud vision APIs, then discussed use cases for this innovative technology—primarily in discoverability and accessibility.
In this follow-on post, we focus on data analysis and specifically co-occurrence networks. By way of example we present a co-occurrence network derived from Clarifai image tags, which represents a kind of mental model of the SAGE journal images we processed. The following image is a visualization of the co-occurrence network that we created:
The article starts with a short introduction to co-occurrence networks—what they are, and what you can learn from them. We’ll then take a look at how the visualization was created, what characteristics it has, and then discuss what it tells us about SAGE images and the underlying image tagging techniques.
Data Analysis: Co-occurrence Networks
The visualization above shows co-occurrences between tags assigned to SAGE journal images. A co-occurrence represents, simply put, a link between two things that appear together in some context. For example if two names, say Romeo and Juliet, or flour and eggs, appear together in the same sentence then you can infer that there is a relationship between the topics mentioned.
A co-occurrence network is a collection of concepts and relationships within some scope, such as a body of text, or in our case SAGE journal image tags.
The exact nature of a co-occurrence relationship is not known, you just know that there is a link. As such, co-occurrences are useful for exploring potential relationships, and developing new insights into known concepts. They are imprecise, but quick and easy to determine compared to other more advanced techniques that attempt to extract knowledge semantics from images or text.
We decided to look at co-occurrences between image tags to attempt to extract a rudimentary “mental model” of SAGE journal topics, based on the images published in journal articles.
As discussed in the previous post, image tags from the Clarifai API provide a richer source of concepts compared to other sources such as the Google Cloud Vision API, so we decided to focus on co-occurrences between Clarifai tags.
Clarifai returns a confidence score for each tag, per image. To filter out noise we set a threshold score of 0.95, meaning that the minimum acceptable confidence level was 95 percent. This helped to eliminate spurious concepts from the results and led to a cleaner and more understandable mental model.
For any two tags associated with an image, we counted that as a co-occurrence between the concepts, as the following example demonstrates.
To detect the co-occurrences, we wrote a simple Python script. The script’s job was to extract co-occurrences from the Clarifai responses, then collect these together into a single dataset of all known concept co-occurrences, with a count per co-occurrence. This was the data for our mental model.
Building the Data Visualization
We fed the dataset of co-occurrences and counts into a tool called Gephi to create a static image of ‘nodes’ and ‘edges’ that is common in network visualization. Gephi is a network visualization and exploratory data analysis tool, ideally suited to an investigation into relationships between image concepts.
Gephi uses circles to display nodes in the network, and these represent the different tags or concepts detected in the images. Gephi draws connections between nodes that correspond to the edges in the network, in this case representing co-occurrence relationships.
To show the relative popularity of concepts and their co-occurrences, the size of nodes and edges in the visualization was adjusted—so bigger circles represent more popular topics, and wider connections represent more frequently occurring co-occurrences. This means that the most topical concepts and relationships for SAGE journal images were highlighted.
The layout of the visualization is based on an algorithm (called Force Atlas 2) that tends to pull related concepts closer together based on relationships, and highlights significant or influential nodes.
The colors show potential communities or clusters of terms, detected based on the structure of the network using a community detection algorithm called the Louvain method. This algorithm finds coherent groups of concepts in the network and is useful for distinguishing different aspects of the data.
In this case, the network layout and communities give strong indications of different classes of image used in SAGE journals.
What can we learn from the visualization of our mental model?
First, there are a handful of very popular concepts that appear in many images and are connected to many other concepts, such as Education, Medicine, Science, Business. These terms correspond nicely with SAGE’s focus on scholarly publishing and suggest that the core concepts in the mental model are sound.
The popularity of the Medicine topic is a little surprising however, as this is a growth area for SAGE. The disproportionate level of popularity is probably due to higher accuracy of tags in medical images which are quite visually distinct, perhaps combined with a greater frequency of images in medical journals.