R scripts - BTYD walkthrough, Papers of Introduction to Machine Learning

R scripts that walkthrough BTYD package

Typology: Papers

2019/2020

Uploaded on 04/22/2020

Camry_16
Camry_16 🇺🇸

1 document

1 / 35

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Buy ’Til You Die - A Walkthrough
Daniel McCarthy, Edward Wadsworth
November, 2014
1 Introduction
The BTYD package contains models to capture non-contractual purchasing
behavior of customers—or, more simply, models that tell the story of people
buying until they die (become inactinve as customers). The main models
presented in the package are the Pareto/NBD, BG/NBD and BG/BB models,
which describe scenario of the firm not being able to observe the exact time at
which a customer drops out. We will cover each in turn. If you are unfamiliar
with these models, Fader et al. (2004) provides a description of the BG/NBD
model, Fader et al. (2005) provides a description of the Pareto/NBD model and
Fader et al. (2010) provides a description of the BG/BB model.
2 Pareto/NBD
The Pareto/NBD model is used for non-contractual situations in which customers
can make purchases at any time. Using four parameters, it describes the rate
at which customers make purchases and the rate at which they drop out—
allowing for heterogeneity in both regards, of course. We will walk through the
Pareto/NBD functionality provided by the BTYD package using the CDNOW
1
dataset. As shown by figure 1, the Pareto/NBD model describes this dataset
quite well.
2.1 Data Preparation
The data required to estimate Pareto/NBD model parameters is surprisingly
little. The customer-by-customer approach of the model is retained, but we need
only three pieces of information for every person: how many transactions they
made in the calibration period (frequency), the time of their last transaction
(recency), and the total time for which they were observed. A customer-by-
sufficient-statistic matrix, as used by the BTYD package, is simply a matrix
with a row for every customer and a column for each of the above-mentioned
statistics.
1
Provided with the BTYD package and available at brucehardie.com. For more details, see
the documentation of the cdnowSummary data included in the package.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23

Partial preview of the text

Download R scripts - BTYD walkthrough and more Papers Introduction to Machine Learning in PDF only on Docsity!

Buy ’Til You Die - A Walkthrough

Daniel McCarthy, Edward Wadsworth

November, 2014

1 Introduction

The BTYD package contains models to capture non-contractual purchasing behavior of customers—or, more simply, models that tell the story of people buying until they die (become inactinve as customers). The main models presented in the package are the Pareto/NBD, BG/NBD and BG/BB models, which describe scenario of the firm not being able to observe the exact time at which a customer drops out. We will cover each in turn. If you are unfamiliar with these models, Fader et al. (2004) provides a description of the BG/NBD model, Fader et al. (2005) provides a description of the Pareto/NBD model and Fader et al. (2010) provides a description of the BG/BB model.

2 Pareto/NBD

The Pareto/NBD model is used for non-contractual situations in which customers can make purchases at any time. Using four parameters, it describes the rate at which customers make purchases and the rate at which they drop out— allowing for heterogeneity in both regards, of course. We will walk through the Pareto/NBD functionality provided by the BTYD package using the CDNOW^1 dataset. As shown by figure 1, the Pareto/NBD model describes this dataset quite well.

2.1 Data Preparation

The data required to estimate Pareto/NBD model parameters is surprisingly little. The customer-by-customer approach of the model is retained, but we need only three pieces of information for every person: how many transactions they made in the calibration period (frequency), the time of their last transaction (recency), and the total time for which they were observed. A customer-by- sufficient-statistic matrix, as used by the BTYD package, is simply a matrix with a row for every customer and a column for each of the above-mentioned statistics. (^1) Provided with the BTYD package and available at brucehardie.com. For more details, see the documentation of the cdnowSummary data included in the package.

0 1 2 3 4 5 6 7+

Frequency of Repeat Transactions

Calibration period transactions

Customers

0

500

1000

1500 ActualModel

Figure 1: Calibration period fit of Pareto/NBD model to CDNOW dataset.

You may find yourself with the data available as an event log. This is a data structure which contains a row for every transaction, with a customer identifier, date of purchase, and (optionally) the amount of the transaction. dc.ReadLines is a function to convert an event log in a comma-delimited file to an data frame in R—you could use another function such as read.csv or read.table if desired, but dc.ReadLines simplifies things by only reading the data you require and giving the output appropriate column names. In the example below, we create an event log from the file “cdnowElog.csv”, which has customer IDs in the second column, dates in the third column and sales numbers in the fifth column.

cdnowElog <- system.file("data/cdnowElog.csv", package = "BTYD") elog <- dc.ReadLines(cdnowElog, cust.idx = 2 , date.idx = 3 , sales.idx = 5 ) elog[ 1 : 3 ,]

cust date sales

1 1 19970101 29.

2 1 19970118 29.

3 1 19970802 14.

Note the formatting of the dates in the output above. dc.Readlines saves dates as characters, exactly as they appeared in the original comma-delimited file. For many of the data-conversion functions in BTYD, however, dates need to be compared to each other—and unless your years/months/days happen to be in the right order, you probably want them to be sorted chronologically and

split.data <- dc.SplitUpElogForRepeatTrans(elog.cal); clean.elog <- split.data$repeat.trans.elog;

The next step is to create a customer-by-time matrix. This is simply a matrix with a row for each customer and a column for each date. There are several different options for creating these matrices:

  • Frequency—each matrix entry will contain the number of transactions made by that customer on that day. Use dc.CreateFreqCBT. If you have already used dc.MergeTransactionsOnSameDate, this will simply be a reach customer-by-time matrix.
  • Reach—each matrix entry will contain a 1 if the customer made any transactions on that day, and 0 otherwise. Use dc.CreateReachCBT.
  • Spend—each matrix entry will contain the amount spent by that customer on that day. Use dc.CreateSpendCBT. You can set whether to use to- tal spend for each day or average spend for each day by changing the is.avg.spend parameter. In most cases, leaving is.avg.spend as FALSE is appropriate.

freq.cbt <- dc.CreateFreqCBT(clean.elog); freq.cbt[ 1 : 3 , 1 : 5 ]

date

cust 1997-01-08 1997-01-09 1997-01-10 1997-01-11 1997-01-

1 0 0 0 0 0

2 0 0 0 0 0

6 0 0 0 1 0

There are two things to note from the output above:

  1. Customers 3, 4, and 5 appear to be missing, and in fact they are. They did not make any repeat transactions and were removed when we used dc.SplitUpElogForRepeatTrans. Do not worry though—we will get them back into the data when we use dc.MergeCustomers (soon).
  2. The columns start on January 8—this is because we removed all the first transactions (and nobody made 2 transactions withing the first week).

We have a small problem now—since we have deleted all the first transactions, the frequency customer-by-time matrix does not have any of the customers who made zero repeat transactions. These customers are still important; in fact, in most datasets, more customers make zero repeat transactions than any other number. Solving the problem is reasonably simple: we create a customer-by-time matrix using all transactions, and then merge the filtered CBT with this total CBT (using data from the filtered CBT and customer IDs from the total CBT).

tot.cbt <- dc.CreateFreqCBT(elog) cal.cbt <- dc.MergeCustomers(tot.cbt, freq.cbt)

From the calibration period customer-by-time matrix (and a bit of addi- tional information we saved earlier), we can finally create the customer-by- sufficient-statistic matrix described earlier. The function we are going to use is dc.BuildCBSFromCBTAndDates, which requires a customer-by-time matrix, starting and ending dates for each customer, and the time of the end of the calibration period. It also requires a time period to use—in this case, we are choosing to use weeks. If we chose days instead, for example, the recency and total time observed columns in the customer-by-sufficient-statistic matrix would have gone up to 273 instead of 39, but it would be the same data in the end. This function could also be used for the holdout period by setting cbt.is.during.cal.period to FALSE. There are slight differences when this function is used for the holdout period—it requires different input dates (simply the start and end of the holdout period) and does not return a recency (which has little value in the holdout period).

birth.periods <- split.data$cust.data$birth.per last.dates <- split.data$cust.data$last.date cal.cbs.dates <- data.frame(birth.periods, last.dates, end.of.cal.period) cal.cbs <- dc.BuildCBSFromCBTAndDates(cal.cbt, cal.cbs.dates, per="week")

You’ll be glad to hear that, for the process described above, the package contains a single function to do everything for you: dc.ElogToCbsCbt. However, it is good to be aware of the process above, as you might want to make small changes for different situations—for example, you may not want to remove customers’ initial transactions, or your data may be available as a customer-by- time matrix and not as an event log. For most standard situations, however, dc.ElogToCbsCbt will do. Reading about its parameters and output in the package documentation will help you to understand the function well enough to use it for most purposes.

2.2 Parameter Estimation

Now that we have the data in the correct format, we can estimate model parame- ters. To estimate parameters, we use pnbd.EstimateParameters, which requires a calibration period customer-by-sufficient-statistic matrix and (optionally) start- ing parameters. (1,1,1,1) is used as default starting parameters if none are provided. The function which is maximized is pnbd.cbs.LL, which returns the log-likelihood of a given set of parameters for a customer-by-sufficient-statistic matrix.

0.00 0.05 0.10 0.15 0.20 0.25 0.

0

5

10

15

20

25

Heterogeneity in Transaction Rate

Transaction Rate

Density

Mean: 0.0523 Var: 0.

Figure 2: Transaction rate heterogeneity of estimated parameters.

0.00 0.05 0.10 0.15 0.20 0.25 0.

0

5

10

15

20

25

Heterogeneity in Dropout Rate

Dropout rate

Density

Mean: 0.052 Var: 0.

Figure 3: Dropout rate heterogeneity of estimated parameters.

2.3 Individual Level Estimations

Now that we have parameters for the population, we can make estimations for customers on the individual level. First, we can estimate the number of transactions we expect a newly acquired customer to make in a given time period. Let’s say, for example, that we are interested in the number of repeat transactions a newly acquired customer will make in a time period of one year. Note that we use 52 weeks to represent one year, not 12 months, 365 days, or 1 year. This is because our parameters were estimated using weekly data.

pnbd.Expectation(params, t= 52 );

[1] 1.

We can also obtain expected characteristics for a specific customer, condi- tional on their purchasing behavior during the calibration period. The first of these is pnbd.ConditionalExpectedTransactions, which gives the number of transactions we expect a customer to make in the holdout period. The second is pnbd.PAlive, which gives the probability that a customer is still alive at the end of the calibration period. As above, the time periods used depend on which time period was used to estimate the parameters.

cal.cbs["1516",]

x t.x T.cal

26.00 30.86 31.

x <- cal.cbs["1516", "x"] t.x <- cal.cbs["1516", "t.x"] T.cal <- cal.cbs["1516", "T.cal"] pnbd.ConditionalExpectedTransactions(params, T.star = 52 , x, t.x, T.cal)

[1] 25.

pnbd.PAlive(params, x, t.x, T.cal)

[1] 0.

There is one more point to note here—using the conditional expectation function, we can see the “increasing frequency paradox” in action:

for (i in seq( 10 , 25 , 5 )){ cond.expectation <- pnbd.ConditionalExpectedTransactions( params, T.star = 52 , x = i, t.x = 20 , T.cal = 39 )

tot.cust.trans <- length(which(elog.custs == current.cust)) cal.trans <- cal.cbs[i, "x"] cal.cbs[i, "x.star"] <- tot.cust.trans - cal.trans } cal.cbs[ 1 : 3 ,]

x t.x T.cal x.star

1 2 30.429 38.86 1

2 1 1.714 38.86 0

3 0 0.000 38.86 0

Now we can see how well our model does in the holdout period. Figure 4 shows the output produced by the code below. It divides the customers up into bins according to calibration period frequencies and plots actual and conditional expected holdout period frequencies for these bins.

T.star <- 39 # length of the holdout period censor <- 7 # This censor serves the same purpose described above x.star <- cal.cbs[,"x.star"] comp <- pnbd.PlotFreqVsConditionalExpectedFrequency(params, T.star, cal.cbs, x.star, censor)

rownames(comp) <- c("act", "exp", "bin") comp

freq.0 freq.1 freq.2 freq.3 freq.4 freq.5 freq.6 freq.7+

act 0.2367 0.6970 1.393 1.560 2.532 2.947 3.862 6.

exp 0.1385 0.5996 1.196 1.714 2.399 2.907 3.819 6.

bin 1411.0000 439.0000 214.000 100.000 62.000 38.000 29.000 64.

As you can see above, the graph also produces a matrix output. Most plotting functions in the BTYD package produce output like this. They are often worth looking at because they contain additional information not presented in the graph—the size of each bin in the graph. In this graph, for example, this information is important because the bin sizes show that the gap at zero means a lot more than the precision at 6 or 7 transactions. Despite this, this graph shows that the model fits the data very well in the holdout period. Aggregation by calibration period frequency is just one way to do it. BTYD also provides plotting functions which aggregate by several other measures. The other one I will demonstrate here is aggregation by time—how well does our model predict how many transactions will occur in each week? The first step, once again, is going to be to collect the data we need to compare the model to. The customer-by-time matrix has already collected the data for us by time period; so we’ll use that to gather the total transactions per day. Then we convert the daily tracking data to weekly data.

0

2

4

6

8

Conditional Expectation

Calibration period transactions

Holdout period transactions

0 1 2 3 4 5 6 7+

Actual Model

Figure 4: Actual vs. conditional expected transactions in the holdout period.

tot.cbt <- dc.CreateFreqCBT(elog)

...Completed Freq CBT

d.track.data <- rep( 0 , 7 * 78 ) origin <- as.Date("1997-01-01") for (i in colnames(tot.cbt)){ date.index <- difftime(as.Date(i), origin) + 1 ; d.track.data[date.index] <- sum(tot.cbt[,i]); } w.track.data <- rep( 0 , 78 ) for (j in 1 : 78 ){ w.track.data[j] <- sum(d.track.data[(j* 7 - 6 ):(j* 7 )]) }

Now, we can make a plot comparing the actual number of transactions to the expected number of transactions on a weekly basis, as shown in figure

  1. Note that we set n.periods.final to 78. This is to show that we are workly with weekly data. If our tracking data was daily, we would use 546 here—the function would plot our daily tracking data against expected daily transactions, instead of plotting our weekly tracking data against expected weekly transactions. This concept may be a bit tricky, but is explained in the documentation for pnbd.PlotTrackingInc. The reason there are two numbers for the total period (T.tot and n.periods.final) is that your customer-by-

0

1000

2000

3000

4000

Tracking Cumulative Transactions

78

Cumulative Transactions

1 5 9 14 19 24 29 34 39 44 49 54 59 64 69 74

Actual Model

Figure 6: Actual vs. expected cumulative purchasing behaviour.

cum.tracking.data <- cumsum(w.track.data) cum.tracking <- pnbd.PlotTrackingCum(params, T.cal, T.tot, cum.tracking.data, n.periods.final)

cum.tracking[, 20 : 25 ]

[,1] [,2] [,3] [,4] [,5] [,6]

actual 1359 1414 1484 1517 1573 1672

expected 1309 1385 1460 1533 1604 1674

3 BG/NBD

The BG/NBD model, like the Pareto/NBD model, is used for non-contractual situations in which customers can make purchases at any time. It describes the rate at which customers make purchases and the rate at which they drop out with four parameters—allowing for heterogeneity in both. We will walk through the BG/NBD functions provided by the BTYD package using the CDNOW^2 dataset. As shown by figure 7, the BG/NBD model describes this dataset quite well. (^2) Provided with the BTYD package and available at brucehardie.com. For more details, see the documentation of the cdnowSummary data included in the package.

0 1 2 3 4 5 6 7+

Frequency of Repeat Transactions

Calibration period transactions

Customers

0

200

400

600

800

1000

1200

1400

ActualModel

Figure 7: Calibration period fit of BG/NBD model to CDNOW dataset.

3.1 Data Preparation

The data required to estimate BG/NBD model parameters is surprisingly little. The customer-by-customer approach of the model is retained, but we need only three pieces of information for every person: how many transactions they made in the calibration period (frequency), the time of their last transaction (recency), and the total time for which they were observed. This is the same as what is needed for the Pareto/NBD model. Indeed, if you have read the data preparation section for the Pareto/NBD model, you can safely skip over this section and move to the section on Parameter Estimation. A customer-by-sufficient-statistic matrix, as used by the BTYD package, is simply a matrix with a row for every customer and a column for each of the above-mentioned statistics. You may find yourself with the data available as an event log. This is a data structure which contains a row for every transaction, with a customer identifier, date of purchase, and (optionally) the amount of the transaction. dc.ReadLines is a function to convert an event log in a comma-delimited file to an data frame in R—you could use another function such as read.csv or read.table if desired, but dc.ReadLines simplifies things by only reading the data you require and giving the output appropriate column names. In the example below, we create an event log from the file “cdnowElog.csv”, which has customer IDs in the second column, dates in the third column and sales numbers in the fifth column.

end.of.cal.period <- as.Date("1997-09-30") elog.cal <- elog[which(elog$date <= end.of.cal.period), ]

The final cleanup step is a very important one. In the calibration period, the BG/NBD model is generally concerned with repeat transactions—that is, the first transaction is ignored. This is convenient for firms using the model in practice, since it is easier to keep track of all customers who have made at least one transaction (as opposed to trying to account for those who have not made any transactions at all). The one problem with simply getting rid of customers’ first transactions is the following: We have to keep track of a “time zero” as a point of reference for recency and total time observed. For this reason, we use dc.SplitUpElogForRepeatTrans, which returns a filtered event log ($repeat.trans.elog) as well as saving important information about each customer ($cust.data).

split.data <- dc.SplitUpElogForRepeatTrans(elog.cal); clean.elog <- split.data$repeat.trans.elog;

The next step is to create a customer-by-time matrix. This is simply a matrix with a row for each customer and a column for each date. There are several different options for creating these matrices:

  • Frequency—each matrix entry will contain the number of transactions made by that customer on that day. Use dc.CreateFreqCBT. If you have already used dc.MergeTransactionsOnSameDate, this will simply be a reach customer-by-time matrix.
  • Reach—each matrix entry will contain a 1 if the customer made any transactions on that day, and 0 otherwise. Use dc.CreateReachCBT.
  • Spend—each matrix entry will contain the amount spent by that customer on that day. Use dc.CreateSpendCBT. You can set whether to use to- tal spend for each day or average spend for each day by changing the is.avg.spend parameter. In most cases, leaving is.avg.spend as FALSE is appropriate.

freq.cbt <- dc.CreateFreqCBT(clean.elog); freq.cbt[ 1 : 3 , 1 : 5 ]

date

cust 1997-01-08 1997-01-09 1997-01-10 1997-01-11 1997-01-

1 0 0 0 0 0

2 0 0 0 0 0

6 0 0 0 1 0

There are two things to note from the output above:

  1. Customers 3, 4, and 5 appear to be missing, and in fact they are. They did not make any repeat transactions and were removed when we used dc.SplitUpElogForRepeatTrans. Do not worry though—we will get them back into the data when we use dc.MergeCustomers (soon).
  2. The columns start on January 8—this is because we removed all the first transactions (and nobody made 2 transactions withing the first week).

We have a small problem now—since we have deleted all the first transactions, the frequency customer-by-time matrix does not have any of the customers who made zero repeat transactions. These customers are still important; in fact, in most datasets, more customers make zero repeat transactions than any other number. Solving the problem is reasonably simple: we create a customer-by-time matrix using all transactions, and then merge the filtered CBT with this total CBT (using data from the filtered CBT and customer IDs from the total CBT).

tot.cbt <- dc.CreateFreqCBT(elog) cal.cbt <- dc.MergeCustomers(tot.cbt, freq.cbt)

From the calibration period customer-by-time matrix (and a bit of addi- tional information we saved earlier), we can finally create the customer-by- sufficient-statistic matrix described earlier. The function we are going to use is dc.BuildCBSFromCBTAndDates, which requires a customer-by-time matrix, starting and ending dates for each customer, and the time of the end of the calibration period. It also requires a time period to use—in this case, we are choosing to use weeks. If we chose days instead, for example, the recency and total time observed columns in the customer-by-sufficient-statistic matrix would have gone up to 273 instead of 39, but it would be the same data in the end. This function could also be used for the holdout period by setting cbt.is.during.cal.period to FALSE. There are slight differences when this function is used for the holdout period—it requires different input dates (simply the start and end of the holdout period) and does not return a recency (which has little value in the holdout period).

birth.periods <- split.data$cust.data$birth.per last.dates <- split.data$cust.data$last.date cal.cbs.dates <- data.frame(birth.periods, last.dates, end.of.cal.period) cal.cbs <- dc.BuildCBSFromCBTAndDates(cal.cbt, cal.cbs.dates, per="week")

You’ll be glad to hear that, for the process described above, the package contains a single function to do everything for you: dc.ElogToCbsCbt. However, it is good to be aware of the process above, as you might want to make small changes for different situations—for example, you may not want to remove customers’ initial transactions, or your data may be available as a customer-by- time matrix and not as an event log. For most standard situations, however,

0.0 0.1 0.2 0.3 0.4 0.

0

5

10

15

20

Heterogeneity in Transaction Rate

Transaction Rate

Density

Mean: 0.055 Var: 0.

Figure 8: Transaction rate heterogeneity of estimated parameters.

Now that we have the parameters, the BTYD package provides functions to interpret them. As we know, r and alpha describe the gamma mixing distribution of the NBD transaction process. We can see this gamma distribution in figure 8, plotted using bgnbd.PlotTransactionRateHeterogeneity(params). We also know that a and b describe the beta mixing distribution of the beta geometric dropout probabilities. We can see this beta distribution in figure 9, plotted using bgnbd.PlotDropoutHeterogeneity(params). From these, we can already tell something about our customers: they are more likely to have low values for their individual poisson transaction process parameters, and more likely to have low values for their individual geometric dropout probability parameters.

3.3 Individual Level Estimations

Now that we have parameters for the population, we can make estimations for customers on the individual level. First, we can estimate the number of transactions we expect a newly acquired customer to make in a given time period. Let’s say, for example, that we are interested in the number of repeat transactions a newly acquired customer will make in a time period of one year. Note that we use 52 weeks to represent one year, not 12 months, 365 days, or 1 year. This is because our parameters were estimated using weekly data.

0.0 0.2 0.4 0.6 0.

0

1

2

3

4

Heterogeneity in Dropout Probability

Dropout Probability p

Density

Mean: 0.2463 Var: 0.

Figure 9: Dropout probability heterogeneity of estimated parameters.

bgnbd.Expectation(params, t= 52 );

[1] 1.

We can also obtain expected characteristics for a specific customer, condi- tional on their purchasing behavior during the calibration period. The first of these is bgnbd.ConditionalExpectedTransactions, which gives the number of transactions we expect a customer to make in the holdout period. The second is bgnbd.PAlive, which gives the probability that a customer is still alive at the end of the calibration period. As above, the time periods used depend on which time period was used to estimate the parameters.

cal.cbs["1516",]

x t.x T.cal

26.00 30.86 31.

x <- cal.cbs["1516", "x"] t.x <- cal.cbs["1516", "t.x"] T.cal <- cal.cbs["1516", "T.cal"] bgnbd.ConditionalExpectedTransactions(params, T.star = 52 , x, t.x, T.cal)

[1] 25.