Prepare for your exams
Get points
Guidelines and tips
Sell on Docsity
Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Docsity AI

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search for your university

Find the specific documents for your university's exams

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

Analyzing and Summary About Spotify Data, Summaries of Music

Australian National University (ANU)Music

Spotify is largest music streaming service and define terms and clustering methods.

Typology: Summaries

2021/2022

Uploaded on 07/05/2022

carol_78 🇦🇺

4.8

(59)

1K documents

1 / 27

This page cannot be seen from the preview

Don't miss anything!

ANALYZING SPOTIFY DATA

EX PL OR IN G TH E PO SS IB I LI TI ES O F US ER D AT A FR OM A S CI EN TI FI C AN D BU SI NE SS

PE RS PE C TIV E

By Jeroen van den Hoven

Supervised by Sandjai Bhulai

VU University, August 2015.

Discover Summaries of Music Australian National University (ANU)

Partial preview of the text

Download Analyzing and Summary About Spotify Data and more Summaries Music in PDF only on Docsity!

A N A L Y Z I N G S P O T I F Y D A T A

E X P L O R I N G T H E P O S S I B I L I T I E S O F U S E R D A TA F R O M A S C I E N T I F I C A N D B U S I N E S S P E R S P E C TI V E

By Jeroen van den Hoven

Supervised by Sandjai Bhulai

VU University, August 2015.

SUMMARY

Spotify is the largest music streaming service available. The company started in 2006 in a time when piracy caused considerable losses to the music industry. In January 2015 they had 60 million users in total of which 15 million premium users (1) and these numbers seem to be increasing. Spotify offers free streaming of music to its users, though one can purchase a premium membership for added benefits, such as no advertisements and being able to listen to music offline.

The large number of users and content of Spotify create a large database of users and songs that users listened to that could hold interesting patterns and information for related companies, such as Spotify themselves, record companies or radio stations. The dataset in question has been provided by , so we first look for general applications of the data and then focus on possibilities that will also be useful to , but will also be interesting from a scientific perspective.

By performing some statistics on the entire dataset, we try to determine the worth of the Spotify data for both and for scientific purposes. We will answer a few relatively simple questions regarding interesting patterns found in the data and try to formulate a good model that can be used with this data. After that, we will try to overcome the problems that arise when applying our model of choice to this dataset and deliver a way to create such a model for this database.

In the end we decided to try to perform a clustering on this dataset. This presented some challenges, such as a dataset of mixed variables, containing both continuous and nominal variables. Deciding that we did not want to use basic techniques to solve, we looked further for solutions and found two possible candidates: the cluster ensemble approach and Gower’s distance metric. The metric used to evaluate whether or not a clustering was good was the cophenetic correlation coefficient.

After some trials, the cluster ensemble approach appeared to be an ineffective way of tackling the issue of mixed variables since it resulted in clusterings with poor fitness, with a maximum of 0.4 on a scale of 0 to 1. We expect this has something to do with the amount of unnecessary information loss that will be lost in the process of using the cluster ensemble approach.

Gower’s distance metric performed considerably better with an initial fitness of 0.68. After some optimization we ended with a fitness of 0.94. We decided to base our final clustering on this model. However, upon analysing this final clustering we found that the set of weights used for this fitness resulted in the dataset only being split on two variables. It seems that, though Gower’s metric does have the potency to reach high fitness, it may have a tendency to be biased, depending on the chosen weights and the underlying data.

In the end we will also propose a few changes to the methods that could be used to improve the effectiveness of both clustering methods.

INTRODUCTION

The large number of users and content of Spotify create a large database of users and songs that users listened to that could hold interesting patterns and information for related companies, such as Spotify themselves, record companies or radio stations. The dataset in question has been provided by , so we will first be looking for general applications of the data and then focus on possibilities that will also be useful to , but will also be interesting from a scientific perspective.

This paper will try to determine the worth of the Spotify data. This will be done according to the following procedure:

 Exploration of the data.  Evaluation of the usefulness of different variables in different models.  Choosing the most promising model based on the evaluation we just did, with proper argumentation. This argumentation will be based on how interesting the problem is from a scientific point of view and a business view, coupled with how likely it will be that the model can actually be build.  Building a prototype of said model.

By combining all the information we just provided, the following research question is an obvious choice:

What are the possibilities of the Spotify dataset for and which ones will be most interesting, both from a business perspective and a scientific perspective?

LITERATURE AND BACKGROUND

In this section we will be discussing some of the necessary definitions, techniques, and background information for this paper.

TERMS

Before we get to the technical details of the paper, it will be useful to define some of the terms that will be used regularly:

 Instance : One measurement: one song that was listened to at some time by someone. This corresponds to one row in the database.  Nominal variable : A variable that takes values from a finite range of possibilities, for instance gender or device type.  Continuous variable : A variable that takes values from a range on the real axis.

CLUSTERING METHODS

In the end this paper will focus mainly on clustering the Spotify data. To do this we need a good clustering method. There is a large selection to choose from, starting with hierarchal or non- hierarchal clustering and different clustering methods for both categories. Another important question is related to the type of data that we have available. We will be getting more into detail regarding this later on, but for now the most important piece of information is that the dataset has mixed variables with both nominal and continuous variables. This creates a problem, since most distance metrics and some clustering methods do not work well with this type of dataset. Simply replacing the nominal variables with dummy binary variables would normally be an option; however there is one interesting variable that has almost 700 different values, which would probably lead to a significant decrease in performance if they were converted to binary variables. To solve this problem we will be looking at two methods: Gower’s metric and a cluster ensemble approach (2), which will be explained below.

NON-HIERARCHAL CLUSTERING

For non-hierarchal clustering we will be looking at k-means. K-means clustering is one of the better-known non-hierarchal clustering methods that chooses K centres for K clusters and assigns each instance to the closest cluster. It then recomputes the centre of each cluster by taking the average for each variable of all instances that are part of the cluster and repeats the process. So, in essence we will do the following:

Initialise K vectors Mi , i ε {1,2,…, K }, representing our K clusters. This can be done at random or by choosing K different instances from our original dataset.
Until we achieve convergence, do the following: 2.1 Assign each instance Xi to its nearest cluster centre. 2.2 Recompute the cluster centre for each cluster by averaging all instances in that cluster.

The assignment of instances to clusters is done in the following way: (3)

1 𝑖𝑓 𝑑(𝑋𝑖, 𝑀𝑗) = min 𝑘 ∈{1,…,𝐾}

𝑁

𝑘=

Where:

 Sij := the distance between observations Xi and Xj.  wk := the weight for variable k.  Sijk := the difference between Xik and Xjk.

In the original formula the Wk is replaced by Wijk , but we will be using the same weights for each pair of observations.

CLUSTER ENSEMBLE APPROACH

The basic principle of the cluster ensemble approach is a divide and conquer technique: it focuses on dividing the dataset in two datasets: one with all the nominal variables and one with all the continuous variables. The individual datasets are then clustered like normal datasets, which is possible since they only contain variables of one type. Once both datasets have been clustered, the results are combined into a new dataset of nominal variables, which is clustered again, resulting in the final clustering (6). The advantage of this technique is that one can use existing techniques to cluster the separate datasets and the final dataset.

COPHENETIC CORRELATION COEFFICIENT

One problem of clustering with this dataset is that there are no predefined clusters. This makes the process of determining whether or not a clustering is a good fit difficult. In order to be able to distinguish a good clustering from a bad clustering, we need a different evaluation method.

For non-hierarchical clustering methods we will be using the cophenetic correlation coefficient (7). This is a measure of how well a dendrogram matches the underlying distance matrix. It is defined as the correlation between the Euclidian distance and the distance in the dendrogram (8), or in our case, between Gower’s distance metric and the distance in the dendrogram. The distance between two instances in the dendrogram is defined as the height in the dendrogram where two instances are joined for the first time. The resulting formula is as follows:

√(∑^ 𝑖<𝑗 (𝑥(𝑖, 𝑗) − 𝑥̅)^2 )(∑^ 𝑖<𝑗(𝑡(𝑖, 𝑗) − 𝑡̅)^2 )

Where:

 c : the cophenetic correlation coefficient  x(i,j) : the Euclidian / Gower’s distance between instances i and j.  t(i,j) : the distance between instances i and j in the dendrogram, defined as the height in the dendrogram where the two instances are joined for the first time.  x : the average Euclidian / Gower’s distance between instance.  t : the average distance between instances in the dendrogram.

The fit is deemed reasonably good if the cophenetic correlation coefficient lies between 0.7 and 0.8 on a scale from 0 to 1, good when it is in the range (0.8,0.9] and very good for any value larger than 0.9 (9)

SILHOUETTE

A more visual criteria to decide whether or not a clustering is good is the (average) silhouette (10). The silhouette is a measure of how well an instance is matched to its own cluster compared to the closest other cluster. By looking at the average silhouette over all instances, we can get a good idea whether or not the current clustering is appropriate. By doing this for multiple different numbers of clusters, we can determine a good value for the number of clusters.

Before we continue, we need to define a few variables:

ai : average dissimilarity of instance i to all other objects in a. This variable has value 0 for a cluster of size 1.
di,c : average dissimilarity of instance i to all other objects in c.
bi : min𝐶 ≠𝐴 𝑑𝑖,𝐶

We will then be looking at the silhouette si of instance i: (10)

max{𝑎𝑖, 𝑏𝑖}

We can see that:

−1 ≤ 𝑠𝑖 ≤ 1

Now we can get an idea about what si represents : (10)

if si is close to 1, then ai is much lower than bi , indicating that instance i is assigned to the proper cluster.
if si is close to -1, then ai is much higher than bi , indicating that instance i is assigned to the wrong cluster.
if si is close to 0, then ai is approximately equal to bi , indicating that it is unclear to which cluster instance i should be allocated.

By looking at the average silhouette S we can determine whether or not instances have been properly assigned to a cluster. This can be used to determine the number of clusters by computing multiple different clusters and their average silhouettes and plotting these in a simple graph. We can then select a cluster based on the value of the average silhouette and the number of clusters. For instance, a value of K = 2 clusters might have a high silhouette, but not enough clusters for us to actually work with.

METHODS

The goal of this paper is to determine the most potent application for the Spotify dataset. In order to do this, we will be following these steps:

 Exploration of the data.  Evaluation of the usefulness of different variables in different models, as well as the models themselves.  Choosing the most potent model based on said evaluation, with proper argumentation.  Building a prototype of said model.

RESULTS

DATA EXPLORATION

This dataset is not the entire dataset, but just a sample from the main database of Spotify for the month of March from the Netherlands. It contains data from 969 different users. Approximately 18.000 songs can be found in the ± 113.000 instances. The dataset contains 52 columns, corresponding to 52 different variables. The first 16 variables contain information about the user, whilst the other 36 variables describe the song. We will focus mainly on the first 16 variables. This is because we deemed the music related variables to be of no use for larger models with such a small dataset. With 18.000 different songs and 113.000 instances, we have an average of ± 6 instances / song, which will not be sufficient data to construct a model with. Furthermore, only 0.36% of the songs have been listened to at least 100 times.

We will be looking at the following variables:

 Source  Device type and OS type  Gender  Region  Age

SOURCE

One of the interesting variables is the source variable. It describes how the song was found:

Source: Album Artist Collection Other Others_playlist Search

Number : 7731 9038 44867 23628 24505 3248

Table 1 and Figure 1: The number of instances found through each source.

As we can see, it’s clear that collection is the most popular choice of finding a song on Spotify. Others_playlist and other also provide a sizable portion of the methods for finding a song. Surprisingly, the search function is ranked as the function that is used the least to find a song. We are not sure why this is the case; this might have something to do with how the variable is recorded, but this is just speculation.

DEVICE TYPE & OPERATING SYSTEM TYPE

Another interesting variable is the device type. As the name implies, it records on which type of device the song was listened to:

Source: Desktop Mobile Tablet

Number: 56866 42347 13804 Percentage originating from a premium account: 51% 78% 51%

Table 2 and Figure 2: The number of instances on each device type.

Desktop leads as the favourite device for using Spotify with just over half of all recorded instances. Mobile devices follow in a reasonably close second position with 42.347 recorded instances. It seems that the tablet is not a very popular device for listening to Spotify, but to make a better decision about this we would need data from a longer period of time. The large majority of instances from mobile platforms seem to come from premium users, though desktop and tablet do not perform poorly, since both have a premium usage percentage of 51%.

It is interesting to note that the smart TV is not a separate device, though Spotify does support an application for these TV’s. Either smart TV’s are added to another device type, or the application came out after March 2015, or no one in the Netherlands uses this application, which seems very unlikely. There are probably more possible reasons why this device is not shown in this dataset, but since we do not have any means of confirming any of them, we will not speculate any further.

Another variable, the operating system type, is also interesting in combination with the device type:

Source: Android Browser iOS Linux Mac Other Windows W.

Phone

Number: 19962 995 35690 243 12009 10691 32928 499 Percentage from premium users:

Table 3: The number of instances per operating systems. As expected, the number of instances on mobile and tablet from the device type variable equal the number of instances on Android, iOS and Windows Phone (W. Phone). The same applies to desktop and browser, Linux, Mac, other and Windows.

REGION

It can also be quite interesting to see from which areas of the Netherlands the users come.

Province Number of

instances

Number

of users

Average instances per

user per province

Percentage

premium users

Unknown 4050 9 450 11.1% Drenthe 1961 66 29.7 42.7% Flevoland 1353 15 90.2 37.5% Friesland 1373 21 65.4 23.3% Gelderland 8105 81 100.1 36.7% Groningen 3600 27 133.3 30.6% Limburg 3343 41 81.5 36.0% Noord Brabant 18989 133 142.8 49.3% Noord Holland 25063 199 125.9 42.8% Overijssel 8824 56 157.6 35.4% Utrecht 12118 108 112.2 41.0% Zeeland 1616 14 115.4 30.4% Zuid Holland 22622 188 120.3 44.2% Table 5: The number of instances, the number of users, the average number of instances, and the percentage of premium users per province.

Figure 5: The percentage of premium users per province.

Table 5 and Figure 5 show quite well how types of users and number of users differ per province. Noord Brabant, Noord Holland, Zuid Holland, and Utrecht for instance have a relative high percentage of premium users, a large number of instances and high usage per user. Drenthe has a surprisingly low average number of instances per user with just 29.7, especially compared to Overijssel, where this number rests at 157.6. Drenthe, Flevoland, Friesland, and Zeeland have a very low usage with all of these provinces providing a maximum of around 2000 instances for a database with a total of approximately 113.000 instances. It seems that Spotify still has a potential market that is spread over the Netherlands and not necessarily focussed on a few provinces. The nine users whose location is unknown provide for a significant portion of the recorded instances with an average of 450 instances per user.

Province

Percentage premium users

Unknown NL-DR NL-FL NL-FR NL-GE NL-GR NL-LI NL-NB NL-NH NL-OV NL-UT NL-ZE NL-ZH

AGE

As the second to last variable we will be looking more closely at the age of the users.

Time

period

Number of users

Percentage of premium users

Table 6 and Figure 6: The birth year of users.

Spotify seems to have a relatively young user base. The correlation between the first lowest boundary for each time period and the number of user is 0.89, which indicates a strong positive correlation between the number of users for a specific age bracket and the minimum age of that bracket. This in turn implies that there is a strong correlation between the age and the number of users of that age.

Interestingly though, the younger users do not contribute the most to the number of premium users on a percentage basis. The age group 25 – 65 contributes the most on a percentage basis, ranging from 55% to 65%, whilst the age range of 13-25 only contributes 23.5%. This could be money related, since older people tend to have more money.

PREMIUM USERS

The last variable that we will be looking at is most likely the most important one to Spotify. It tells us what type of user created the instance as well as telling us how they got a certain kind of membership:

Kind \ Type Basic-Desktop Free Premium

Ad 0 42060 0 Paid 1945 0 38570 Partner 0 0 27445 Trial 0 33 2964 Table 7: The number of instances created by each type / kind pair. Type represents the type of user; free, premium or basic-desktop. Kind represents how the user got this status: through having ads on their Spotify, they paid for it, through a partner, or through a trial version.

WHEN DO MOST PEOPLE LISTEN TO SPOTIFY?

A relatively easy question to answer, since we have the stream time available in our dataset:

Figure 7: A histogram of when instances were streamed.

It is quite obvious that people tend to listen less to music on Spotify during the night. The number of instances starts to increase around 5:00 – 6:00 in the morning, growing steadily until around 9:00. After that the number of instances increases, though at a slower rate, until 15:00, when it slowly starts to decline at a reasonably steady pace until midnight. This information could be useful when planning server management or for planning the maximum capacity throughout the day. However, when we start looking at when a specific playlist is listened to, things can look quite different.

WHEN DO PEOPLE LISTEN TO A SPECIFIC PLAYLIST?

When looking at a histogram of the average time on which a playlist is listened to, we find a similar pattern as we found when looking at when songs were streamed:

Figure 8 a (left) and b (right). Left: A histogram of the average time of day when playlists were listened to. Right: A histogram of one of the playlists, created by a user with id 11155891456.

It is clear that the histogram for the average time of day per playlist resembles the histogram for when instances were streamed. However, when we construct histograms for individual playlists, we can see clear differences in playlists. One example is shown in figure 8b, where we can clearly see that this playlists is listened only between 20:00 and 10:00. There are also playlists to which people listen to at every point of the day, those that are only listened to during working hours, or those that are listened to significantly more during 20:00 – 24:00, but not as much during the rest of the day. This shows that we cannot assume that all playlists are similar in nature, which probably is not a significant shock to anyone. However, this could perhaps be used to advertise different playlists to people at different times of the day.

MODEL SELECTION

We will start by looking at the recommender system. A recommender system in this situation would try to predict which playlists / songs would be liked by a certain user based on the history of that user and the history of other users. Based on what other users with a similar history listened to it will recommend playlists or songs to the user. Such an analysis requires a vast amount of data, especially when the number of possible playlists / songs can be very large. Since Spotify managed to generate just over 100.000 instances just in the Netherlands with a subset of its users, we believe that, using the entire dataset of Spotify, a recommender system would be possible, though it would require a significant amount of computing power. However, since we have a small subset of the data available, we do not have enough data available to create such a recommender system. As mentioned before, we have 18.000 songs in this playlist, which means each song is listened to approximately 6 times on average. Combined with the fact that just 0.36% of the songs has been listened to at least 100 times, it is clear that no accurate recommender system can be built. Furthermore, Spotify already has a recommender system, so the added value of another system would be small and probably not very useful to . In our opinion, the recommender system is not a very viable idea for this subset, but it could be an interesting tool if it can be run on the entire dataset.

As for the clustering of users, we would try to find patterns in the dataset that we could exploit to identify different groups of users. This might then be used to target different groups of users in different ways and might help with sales. We have almost 1000 users available for clustering, which is a reasonable number. Furthermore, we have 16 descriptive variables for each of these users such as age, gender, and membership type, giving us a reasonable number of variables to perform the clustering on. A tricky part of this would be that it would be unsupervised learning without any previous information on what type of clusters / people could be expected to show up, so deciding what clustering is a good clustering might become difficult. However, if we do manage to find criteria to determine a good fit for a clustering through some alternative means, we might discover some interesting consumer groups in the Spotify database. To be able to work properly with users instead of instances, we would have to transform the database from an instance-focussed database to a user-focussed database, though this should not be too difficult. In our opinion, the clustering of users is a viable model for this subset of the data and possible an even more potent model if used on the entire dataset.

Our choice for a model is quite obvious: we prefer to try the clustering of users above the recommender system, at least for this dataset.

CLUSTERING

When we want to perform a clustering, we first need to determine a few key factors such as the clustering method that we use and, in our case, how to handle the different types of variables. In our database we have nominal and continuous variables that we would like to use for the clustering. However, most clustering algorithms can only work with either nominal or continuous variables, unless some transformation is used on the database. We will discuss the choices that we made regarding these factors after choosing the variables used for the clustering.

SELECTION OF VARIABLES

We will be using the following variables for our clustering:

Name Type Description

Hour of streaming Nominal Time of streaming Stream length Continuous Duration of stream Source Nominal Denotes how the song was found, for instance through artist or collection. Source uri Nominal If source was a playlist, then this denotes the specific playlist. Device type Nominal Desktop, mobile, or tablet. No smart TV. Operating system Nominal The operating system of the device. Region Nominal The province where the song was listened to. Gender Nominal Gender Birth year Continuous Birth year Access Nominal Type of account. Basic desktop, free or premium Type Nominal Ad, paid, partner, trial. Gives extra information about access. Table 10: the chosen variables.

As discussed before, the music specific variables will not be used for clustering due to the many different songs present in the database. This leaves us with 16 possible variables for clustering. We removed five more variables:

 The user id. If we want to cluster different users, then this will be of no use.  The metadata id. We do not know where this is for, so we will not be using it.  The date of streaming. We do not use this for the same reason as the user id.  Cached. This variable is always false, so it does not give us any information.  Stream territory: For us this is always NL, for the Netherlands.

This leaves us with the above-mentioned set of eleven variables.

HIERARCHAL OR NON-HIERARCHAL

In order to be able to determine which clustering is the best one for our problem we need to be able to rate different clusterings. In supervised learning cases this would not be a problem, since one could do some basic tests to determine how well a clustering fits the data. We, on the other hand, have an unsupervised case, since we do not know in advance to which cluster an individual belongs. Since the data has twelve variables, using simple visual test would be difficult