Machine Learning

Applied clustering. Highlighting objective niches of Apple App Store mobile applications

Pinterest LinkedIn Tumblr

Introduction

When there are a lot of applications, this is a problem.

Growing from year to year, the market for mobile applications will reach $ 189 billion by 2020. In the Apple App Store alone, the number of mobile applications has exceeded 3 million, of which 783,000 are games, and the rest – non-game applications. And while all games are categorized into 19 gaming categories (an average of 41,000 games per category), 2.3 million non-gaming applications are enclosed in just 24 categories (almost 96,000 applications per category on average).

User problem. There are a lot of applications, but few categories.

On the one hand, developers are often forced to place their applications in categories that only very roughly describe their functionality, and on the other hand, users are forced to try on their needs for a limited number of static categories that exist in app stores in an almost unchanged form since their opening. It is very easy not to find, not notice or miss the required application in such a situation. And App Store search is not a big help here, for several reasons. The App Store search searches only by app name and keywords (no more than 100 characters) that a developer should have thought about when publishing their app. If you think about how a person gets to know any new concept or phenomenon, it becomes clear that a new interesting application has a chance of success. that has no analogues and opens a new niche, very few. After all, a person’s need arises only in relation to concepts or phenomena familiar to him. A person who does not know about the existence of an aircraft and the possibility of air communication will never think about the need to fly somewhere.

Developer problem. Developers are forced to spend huge resources to tell about a new product.

Mobile apps are often venture projects. The conservatism of app stores is challenging for investment activities. At the moment, investors do not have a tool for objective analysis of mobile trends and early identification of projects that are just beginning to form new directions for the development of mobile applications.

Investor problem. It is impossible to systematically find fresh interesting projects among 3 million applications.

Formulation of the problem

Distribution of applications into objective groups, reflecting the real essence and features of applications. The source material is the texts of application descriptions, since it is in the descriptions that the developers try to fully reveal the idea and advantages of the product. At the entrance, we have about 20 thousand descriptions of iOS applications, fairly evenly distributed across 20 categories of the Apple App Store USA. It is necessary to identify the objective categories of applications and their optimal number.

The first step in text analysis tasks is its preprocessing. The so-called stop words, insignificant symbols and most punctuation marks should be removed from the text. Moreover, stop words can be either general, independent of the subject of the text corpus ( prepositions, participles, interjections, numbers, particles ), or specific (app, http, www, email … ). All other words using lemmatization or stemming algorithms must be brought back to their initial form.

Solution methods

To solve the problem, we need to understand which applications already hosted in stores are the closest to each other. That is, our task is the task of clustering and we must figure out how to do clustering for application description texts.

Clustering algorithms, such as k-means, do not know anything about texts and want to receive an array of numerical vectors as input. So we have to make a vector out of the text. Let’s consider what methods allow us to do this.

TF-IDF


TF-IDF is a statistic that is used to assess the importance of a particular word in the context of describing each application.

TF or word frequency is the ratio of the number of occurrences of a particular word to the total set of words in the text under study. This indicator reflects the importance of a word within one description.
IDF, or Inverse Document Frequency, is the inverse of the frequency with which a particular word appears in a text corpus. Thanks to this indicator, it is possible to reduce the weight of the most widely used words (prepositions, conjunctions, general terms and concepts).

The TF / IDF score will be higher if a certain word is used with high frequency in a specific text, but rarely in other documents.

The result of the algorithm is a vector of dimension N, where N is the number of unique words in the text corpus. The dimension of a vector for typical texts is on the order of 50,000-100,000 words. For further analysis, dimensionality reduction is required.

Word2Vec , Glove , FastText

Algorithms for representing words in a text corpus in vector space. The dimension of space does not exceedseveral hundred.

The essence of the method is simplified as follows: the algorithm tries to predict the current word from the words surrounding it. As a result of the operation of the algorithm, the resulting vectors of interconnected words appear in space quite close. For example, for the word king, the closest words are queen , princess .

To obtain a vector of the entire text (for example, an application description), as a rule, vectors of words are summed. Hence the obvious drawback – the resulting vector is noisy with vectors of words that are weakly related to the meaning of the text. Representation of words in the form of vectors works well on individual words, satisfactory – on phrases, poorly – on phrases, this approach does not work at all on long texts. To improve the vector representation of texts, you need to carefully clean the analyzed text from insignificant words or use weighted summation of vectors. At the same time, the choice of the weighing algorithm is far from always obvious.

Another drawback of the method is the dependence of the quality of the model on which text corpus was used for training. For the Russian language, in order to properly train a model, especially in a certain area, it is necessary to collect a sufficiently large thematic text corpus, as a rule, more than 10 million words.

Sparse autoencode


An autoencoder is an unsupervised learning algorithm that uses a neural network to cause an input feature vector X to trigger a network response Y equal to the input vector, i.e. Y = X . 
Autoencoder example:

Moreover, special conditions are imposed on the network architecture:

  • The number of neurons in the hidden layer L2 must be less than the dimension of the input data (as in the figure);
  • The activation of neurons in the hidden layer should be sparse (the number of inactive neurons in the hidden layer should significantly exceed the number of active ones). As a rule, the proportion of active neurons does not exceed 5%.

The advantage of the method lies in the fact that the resulting vector (usually the output of the hidden layer L2) quite accurately conveys the meaning of the text, the features of which were submitted to the input. Disadvantage of the method: training also requires a large, high-quality text corpus.
In the process of analyzing the methods described above, we came to the conclusion that none of them can solve the problem with an acceptable quality. As a result, we applied a slightly different approach, the description of which is beyond the scope of this article.

So. We got the vector from the description text. What’s next?

Unfortunately, the size of this vector is several hundred values. Trying to do clustering on vectors of this dimension will take a very long time, even taking into account the algorithms that support parallel operation. We need to look for ways to reduce the dimension.

PCA

Principal component analysis is a way to reduce the dimension of data with minimal loss of information. The algorithm eliminates the redundancy and correlation of the input vector array by choosing optimal projections. As a result, at the output we get a set of the most significant, independent components. The main disadvantage of PCA is that the main projections are found through the construction of a covariance matrix, the size of which is directly proportional to the dimension of the input data. Because of this, for very large data dimensions, the search for eigenvectors may not be possible.

Truncated SVD

Singular value decomposition – factorization of the matrix of text features.
The initial matrix is ​​decomposed into several components, the physical meaning of which is sequentially executed linear operators of rotation and expansion of the initial vectors. Singular value decomposition components clearly show geometric changes when mapped into a vector space of a different dimension.

Self-organizing map The

self -organizing map of Kohonen is a type of neural network related to unsupervised learning algorithms. The main goal of this algorithm is to find hidden patterns in the data by reducing the dimension of the original space.

An important property of Kohonen maps is that they construct a mapping into a low-dimensional space (usually two-dimensional) in such a way that the topology of the original space is preserved. That is, the obtained projections can be compared with each other, the distance between them can be found, etc.

The advantages of Kohonen maps are resistance to noisy data and fast learning. The disadvantage is that the final result of neural networks depends on the initial settings of the network.

t-SNE

This is a very popular algorithm that also allows data dimensionality reduction. This algorithm can collapsehundreds of dimensions to just two, while maintaining an important relationship between the data: the closer the objects are in the original space, the smaller the distance between these objects in the reduced-dimensional space. t-SNE performs well on small to medium real datasets and doesn’t require a lot of hyperparameter tweaks.

For our dimensionality reduction problem, the t-SNE 2D algorithm showed the best results.

The result of the t-SNE algorithm is shown in the following picture.

The image clearly shows clusters of applications and even guesses the outlines of individual groups. Now is the time to do clustering. But how many clusters should you choose? 30, 50 or 100?
Unfortunately, this question has not yet been resolved in cluster analysis, and we must look for an answer based on subjective factors. For example, if we select too few clusters, we will lose selectivity; if there are too many, the generalizing ability will decrease. We need to look for a compromise.

Choosing the number of clusters

There are several approximate methods for finding this value. One of them – perhaps the most understandable and intuitive – is the so-called “elbow” method or the elbow method. The essence of this method is to plot the dependence of the variance of the partition quality on the number of clusters k .

At the very beginning, adding new clusters helps the model a lot, but starting from a certain point, an increase in the number of clusters ceases to significantly improve the quality of the partition. Then, conditionally, this amount can be called optimal. Conditionally, because the quality of the resulting model needs to be checked visually. For example, in our case, judging by the picture above, the optimal k is 40-50, but when checking the resulting clusters, many applications with clearly different meanings still ended up in the same clusters. Therefore, it was decided to increase k to 100.

So, we carry out clustering using the K-means algorithm at k = 100 and look at the description of applications taking into account the division into clusters.

In the picture, we see colored and numbered cluster groups. What can you spot right away? Some clusters are located separately, there are no “neighbors” next to them. But there are clusters that have a pronounced border. For example, in the highlighted part of the image, we can see clusters 0 and 60, which halve a large group of applications. We make the assumption that here, as a result of clustering, two objective categories were formed.

Let’s see what applications fell into these two clusters and try to understand whether we have achieved our goal or not. First, let’s build histograms of the market categories of applications that are located in clusters 0 and 60.

Now let’s take a look at a few applications and the top 50 words of application descriptions in each of the clusters.

cluster 0
Babbel – Learn Italian
Innovative 101 Learn Languages
Learn Danish: Language Course
Page: English Grammar Checker
Oxford Spanish Dictionary word phrase learn language speak english vocabulary dictionary spanish lessons native grammar audio babbel sentence speech pronunciation travel sign search help chinese course practice voice nemo translate study rhyme write italian term phrasebook hear time french speakers asl feature odyssey korean categories like record review spell read period text japanese

cluster 60
Live Translator HQ
Offline Translator Pro 8 lang
Translator with Speech Pro
Spanish Dictionary +
Arabic Dictionary Elite
word english translate dictionary languages ​​text language translation phrase spanish translator voice french chinese speech translations pronunciation japanese portuguese italian german speak russian arabic offline sentence search learn korean dutch support danish polish turkish dictionaries swedish norwegian finnish czech greek definitions hebrew romanian hindi vocabulary internet thai hungarian connection catalan

While it is rather difficult to determine the nature of the application based on the top 50 words, it is quite easy to understand from the names that cluster 0 consists of applications mainly intended for learning languages, and cluster 60 consists mainly of translators.

Thus, if we take the top 3 market categories from the histogram of a cluster and combine them, we will get the name of the objective pseudo-category corresponding to this cluster. For example, it will look like this:

  • cluster 0    – education_travel_reference
  • cluster 60 – reference_travel_books

The main market category will go first, followed by the specifying categories, which give the main category an additional semantic coloring.

In order to assess the quality of the split and see how the market categories are distributed in the clusters, there is another visualization method – a heatmap. It shows how and how many market categories are distributed in clusters. Also, the heat map allows you to see the hierarchy of clusters, i.e. how clusters relate to each other.

This is how it looks in our case.

Here you can see both independent clusters and entire groups of closely related clusters. Cluster groups characterize the main market categories that have different meanings. For example, the Entertainment group, in the lower right corner, has additional, clarifying categories – Photo & Video, Shoping, Music, Lifestyle.

conclusions

Using various methods of text analysis, we have created a tool that allows you to systematically record the state of the mobile applications market, quickly identify new niches and even individual applications that cannot be attributed to any existing category offered by application stores.

Write A Comment