Mixpanel At Scale

(Caveat: This talk primarily covers Mixpanel but could apply to any similar tracking and/or CRM providers)

Mixpanel is a great tool, very easy to set up, very easy to understand, and you can quickly get timely information on what your users are doing. It gives you device and browser segmentation of any events out of the box as well as a basic CRM with per user event playback, and ad-hoc email targeting.

Once you have used it in production for a while you may find yourself coming up against a one or both of the following problems:

1) Old Events that have since been renamed/repurposed/removed are still hanging around the interface that your marketing analysts are using for querying, meaning confusion and institutional knowledge is required to know the caveats. 2) You use Mixpanel SDK’s across various platforms, or products, an event names and properties that should be the same across all of these are not e.g. casing differences, ‘true’ as a string or boolean value. This again making it hard to query without institutional knowledge of the caveats for certain properties. 3) People properties are out of sync with the state of the users in your application database, such as related records counts, feature usage, emails, meaning ad-hoc syncing is required which may not scale are you grow

I believe this a problem for tracking solutions which allow you to identify users at large (1), but for the purpose of this post I’ll focus on a solution a team I worked strived towards for Mixpanel to address these problems:

1) The Truth for Events: Lock down your Event names and properties and People properties with a map in a resource that can be shared across all the platforms and devices. For example, some sort of JSON whitelist like { userSignsUp: { eventName: 'User signs up', properties: { ... } }. As well as making it difficult to trigger an event which is outside of these known events, it will also enforce implicit visibility of changes to Events for developers across the different products and platforms, meaning conflicts and problems can be addressed before the changes hit production. It may hinder releases if products have different schedules however.

2) State for People: Make Mixpanel People Properties a read only copy of current application state. This is more general good data practice for data syncing but can easily lead to problems if you haven’t experienced these problems before, as I hadn’t. Your marketing team want to use Mixpanel as a CRM to contact and segment users for their mail-outs and experiments, so they want the properties of the users they wish to segment to be up to date. You can set these in the client side SDKs, however you end up having to reimplement the same properties multiple times to make sure that the same properties are set regardless of which platform the user signed up on, and in on a browser they could be using an adblocker to block Mixpanel in its entirety!

The solution we strove for was to sync any changes to the user state in the server-side model (in our case a Rails table model called User) to Mixpanel as well. This means as the soon as the user is created in your application, they will appear in Mixpanel as well. The same applies for future changes, such as counts, or state such as subscriptions. This meant the data only flowed one way for the most part: Changes to application User state -> Enqueue Mixpanel background worker -> Mixpanel People Property changes. The only exception is properties specific to each product, such as ‘usedIOSApp’ or ‘usedChromePlugin’, which can’t be avoided, but they won’t suffer from the same duplication issues.

3) Identify using a unique identifier that will never change: Make sure to use something like a user UUID to alias and identify the user across all the devices from the beginning, rather than any value which seems unique but may change. The documentation actually mentions using ‘email’ at time of writing, which means from that point on the user will not be able to change their email address! We actually ended up using this for internal events as well, making it easier to tie up application state, Mixpanel events, and our own events if the needs arised.

4) Scripts for cleaning up old data: This is not something I ever implemented but I believe it is something the ends up being very useful as you iterate on Mixpanel usage. Mixpanel’s GUI doesn’t make it easy to bulk delete Events or People properties. However they do have a powerful API, so scripting isn’t too difficult.

5) Server-Side Changes: Be aware of IP and Time: This caught us out in a serious way: when change People properties on the server-side, make sure to set $ip: 0 and $ignore_time: true on the request, as otherwise Mixpanel will use the machine’s IP and time by default. The IP is required so you dont set user’s IP to something incorrect, and the $ignore_time is so you don’t update their Last Seen value, which might be used by Marketing to target users to return.

Hopefully this is a useful guide to using Mixpanel at a larger scale, any questions please leave them in the comments.

Footnotes:

1) Fullstory.com provide an interesting solution. They capture ever interaction on the page and don’t actually allow you ‘track’ your own events. Instead they allow you to find events by pages visited, elements clicked or changed, and user properties all in one search. This is particularly useful besides this lowering implementation costs as it lets you come up with new queries on the fly without having to add new events, and can apply your funnels retroactively rather than hoping to think of everything upfront.

comments powered by Disqus