This article was published as a part of the Data Science Blogathon.
As you know from my previous posts, I’ve started a mailing list to promote my blog. In the last post in this series, we designed a way to answer the question of what’s causing folks to subscribe to my blog by matching an opt-in with a source. If you haven’t checked that out already, you can find that here. This time we take a second look at our subscriber history data and question its validity. For what we’ll be doing today you’ll need to step into Sherlock Homes’ shoes, so time to put on your detective caps and solve a mystery!
So like any good analyst, I was validating the subscriber_feed_history table we built together here which gives me the history of who subscribed to my mailing list. That’s when I noticed some inconsistencies with the table, especially when compared to the actual truth. If you remember the source of this table is the mailing_list feed which is a complete list of users that are opted-in to my mailing list on a particular day.
Well turns out this subscriber_feed is not as reliable as I thought. When I use the subscriber_feed_history table to see how many users are opted-in on a particular day and match that with the email_sent_history (a table derived from logs of email sends), they don’t align 100%. This is odd as you’d assume only users that are opted-in on a particular day will be sent an email on that day but turns out that’s not the case. Welcome to the world of data!
So here are the two tables we’re working with today:
date | user_id | status |
2/2 | 456 | O |
1/31 | 123 | O |
1/30 | 777 | I |
1/28 | 123 | I |
1/27 | 456 | I |
1/26 | 789 | I |
1/25 | 606 | I |
… | … | … |
date | user_id | delivery_status |
2/1 | 777 | Y |
2/1 | 123 | Y |
2/1 | 789 | Y |
2/1 | 808 | Y |
… | … | … |
A small caveat here is that because I don’t send emails daily, the email_sent_history table won’t have all the dates but just the ones when an email was sent.
STOP! Before reading ahead here’s where I want you to take some time and think about the problem yourself before reading my approach.
Now let’s think about how we can approach this problem in steps. First, we need to determine all the users who were subscribed on the day an email was sent since an email is not being sent daily. Once we have this we can compare this with the email sent list for the day to determine two things: (1) Emails that were sent an email but don’t look to be opted in as of that day, (2) Emails that were opted-in on that day but weren’t sent an email. Finally, once we have this figured out, the final step would be to alter the subscriber_feed_history table to accurately account for these inconsistencies.
Now to get the emails that were opted-in on a particular day, we can’t leverage subscriber_feed_history directly, we need an expiry date to check the duration a user was opted in for. For this purpose, we built ourselves the subscriber_feed_history_view which adds an expiry date to the table. See here for how to do this. For the above users, the subscriber_feed_history_view looks something like this:
start_date | end_date | user_id | mailing_list_status |
1/30 | 12/31/9999 | 777 | I |
1/28 | 1/31 | 123 | I |
1/27 | 2/2 | 456 | I |
1/26 | 12/31/9999 | 789 | I |
1/25 | 12/31/9999 | 606 | I |
Now that we have this figured out, the next step is to compare the email_sent_history with the subscriber_feed_history_view to see what’s going on. The way we do that is by doing a FULL JOIN between the users that were opted-in and were sent an email to get both sides of the coin. Remember you’re JOINing to a user’s mailing_list_status at the time the emails were sent.
Here’s how you do that in SQL:
-- email sent history WITH email_sent_history AS ( SELECT date AS sent_date, user_id FROM `email_sent_history` WHERE delivery_status = "Y" ), -- users that were subscribed & the duration opt_in_subscribers AS ( SELECT start_date, end_date, user_id FROM `subscriber_feed_history_view` WHERE mailing_list_status = "I" ) -- FULL JOIN: to not miss info from any side, -- email sent & status joined to get users who -- were subscribed at the time an email was sent SELECT sent_date, opt_in_subscribers.user_id AS opt_in_users, email_sent_history.user_id AS email_sent_users FROM opt_in_subscribers FULL OUTER JOIN email_sent_history ON opt_in_subscribers.user_id = email_sent_history.user_id AND sent_date BETWEEN start_date AND end_date
And here’s what this gets us:
sent_date | opt_in_users | email_sent_users | scenario |
2/1 | 777 | 777 | |
2/1 | null | 123 | Out & Sent |
2/1 | 456 | null | In & Not sent |
2/1 | 789 | 789 | |
2/1 | null | 808 | Never In & Sent |
2/1 | 606 | null | Never Out & Not sent |
From here we see users that have a Null value in the opt_in_users field were sent an email but not opted-in and vice-versa for Null in email_sent_users. Now that we’ve figured out a way to get to the inconsistencies, the next step is to think about what do we do with them? And how do we fix the subscriber_feed_history?
Let’s start with thinking about users which were sent an email but aren’t opted-in. Well the logical thing is to opt them in, right? But when? Like on what date? There are two scenarios here: (1) Invalid opt-out: user=123 has an unsubscribe event recorded on 1/31 before the email sent on 2/1, and (2) Missed opt-in: user=808 which from the looks of it seems to have never opted-in but was sent an email. Now for the first case, you may consider the opt-out captured as invalid and delete it but you’ve to be sure here. Or implement a fix that works in both cases which is to opt-in these emails on the day they were sent an email to fix our subscriber_feed_history. This would entail adding an opt-in record for these users to the subscriber_feed_history and here’s how you do that:
INSERT INTO `subscriber_feed_history` WITH email_sent_history AS ( SELECT date AS sent_date, user_id FROM `email_sent_history` WHERE delivery_status = "Y" ), opt_in_subscribers AS ( SELECT start_date, end_date, user_id FROM `subscriber_feed_history_view` WHERE mailing_list_status = "I" ), -- LEFT JOIN with WHERE to only get users that were -- sent an email but not subscribed sent_but_not_in AS ( SELECT sent_date, user_id FROM email_sent_history LEFT JOIN opt_in_subscribers ON opt_in_subscribers.user_id = email_sent_history.user_id AND sent_date BETWEEN start_date AND end_date WHERE opt_in_subscribers.user_id IS NULL ) -- INSERT an Opt-In record in history table for these SELECT sent_date AS date, user_id, "I" AS status FROM sent_but_not_in
On the other hand, users that appear to be opted-in but weren’t sent an email, must’ve opted out, right? (We make an assumption here which we’ll revisit in the future). Let’s think about the two scenarios here: (1) Invalid opt-out: user_id=456 seems to have unsubscribed on 2/2 but wasn’t a part of the 2/1 send. (2) Missed opt-out: user_id=606 never seems to have unsubscribed but wasn’t a part of the 2/1 email. The real question is when did they unsubscribe? Given the data we have, I’d say our best estimate would be the last email sent date. For example, if user_id=606 was a part of the emails before 2/1, let’s say 1/25, it must’ve unsubscribed between these two dates, and for the lack of better knowledge we can assume it did on 1/26, a day after the last send. So to apply this fix to subscriber_feed_history we add an opt-out event with a new date as follows:
INSERT INTO `subscriber_feed_history` WITH email_sent_history AS ( SELECT date AS sent_date, user_id FROM `email_sent_history` WHERE delivery_status = "Y" ), opt_in_subscribers AS ( SELECT start_date, end_date, user_id FROM `subscriber_feed_history_view` WHERE mailing_list_status = "I" ), -- LEFT JOIN with WHERE to only get users that were -- subscribed but not sent an email in_but_not_sent AS ( SELECT user_id FROM opt_in_subscribers LEFT JOIN email_sent_history ON opt_in_subscribers.user_id = email_sent_history.user_id AND sent_date BETWEEN start_date AND end_date WHERE email_sent_history.user_id IS NULL ), -- every user's latest email sent date users_last_sent_date AS ( SELECT user_id, MAX(date) AS last_sent_date, FROM `email_sent_history` WHERE delivery_status = "Y" GROUP BY user_id ) -- INSERT an Opt-Out record in history for these SELECT last_sent_date AS date, user_id, "O" AS status FROM in_but_not_sent JOIN users_last_sent_date USING(user_id)
So here’s what we accomplished today, we started by identifying some inconsistencies in our email subscriber history table that we’d built. Where the list of users who are subscribed to my mailing list didn’t align with the list of users that were sent a promotional email. To combat this, we start by comparing the two data sources(subscriber & email sent history) to identify users that were subscribed but not sent an email and vice versa. Then we design an approach for each of these cases to fix our original table by accounting for these inconsistencies. And voila by tweaking our subscriber_feed_history it’s now aligned with the real truth. Below are some key takeaways.
Photo by LinkedIn Sales Solutions(Left) & Ben White(Right) on Unsplash
Let me know how you’d approach it. If you found this helpful share it. If you’re into this, you can find me on Twitter @abhishek27297 where I talk data.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.