ACTIVE Network API Developer Blog

The Prediction API: A new Data Science service

The Asset Service group here at ACTIVE builds and maintains a collection of services which ingest, digest, and disseminate all the data used to produce the events and their details that you find on  These "assets" come to our group with varying levels of completion — sometimes a plump data document with good descriptions, useful labels, and lots of other objects, but more often just a name, short summary, and pricing. We allow this nearly free-form submission by design to enable flexibility at the source, but this makes things somewhat tricky when serving them up for easy searching.

To bring the quality of the data to a form suitable for feeding the comprehensive list of world-wide events and activities offered by The Active Network as part of our technology solutions products, we created the Asset Service. Within it is an automated asset processing workflow system that utilizes (among other things) machine learning algorithms and text mining techniques to assess, clean, classify, and generally improve the original submission.  Plugging extensible data science into a large workflow that is core to our technology, and which requires high accuracy and availability, isn't necessarily simple, but the end results can still be quite elegant.  We recently launched a new Prediction API for use in our automated asset processing workflow which illustrates some of the challenges this presents and which serves as a nice piece of data science infrastructure upon which we will be building.  I’d like share with you these challenges and the implementation we chose for this particular API as an introduction to data science here at ACTIVE, including a walk-through to build and serve a machine learning model for classification.  

First, a little background. [tldr]

Asset Processing Overview

The automated asset processing workflow uses an intelligent combination of data science and business logic in the form of self-contained workflow elements. These enable the most efficient data improvement in the shortest amount of time, in concert with ongoing evaluation against statistically determined thresholds for applying changes or signaling next steps. It is all fed with self-evaluating metrics to dictate future enhancements to the workflow process.  Every step includes various measures culminating in a pass or fail threshold decision, from which subsequent steps are taken.  Each step also feeds a number of real-time accessors to the data (e.g MongoDB, our Recommendation API, ElasticSearch), allowing both immediate availability and near-term improved data.  The system ensures proper naming, categorizing, geolocation, and search optimization of each event, while breaking the data down into independent sub-components, each of which can be enhanced and accessed separately (e.g. Places, Organizations, Topics, etc.). Each step also allows for a final evaluation of the need for human intervention, in which case it flags and presents the data piece as a task to specialized event researchers for manual improvement via a web-based GUI.  Subsequent re-submissions of these events are detected upon ingestion, de-duplicated, and smartly re-assigned the prior final changes depending on what data is new for even faster time-to-live improved data on update operations.

Asset Topics

The main taxonomy of assets on is based on descriptive topics.  These range from very general at the top level ("Endurance", "Health") to very specific ("Trail running", "Vegetarian cooking"), and aid in indexing events for fast searching and recommendation.  Topics are rarely included in the data submitted to us (though we do encourage it).  If they are, they are typically too general or inaccurate (perhaps designed to boost search with better-performing but unrelated topics).  Thus, we have to automatically classify each asset with a proper topic or set of topics as part of asset processing.  We do this in the typical fashion, using text mining and predictive classification algorithms, including support vector machines, probabilistic learning algorithms, unsupervised clustering, and tree-based methods.  In subsequent blogs, I’ll talk in more detail about how we accomplish this.  Standing up an API to make use of the predictive models we build, with the speed and availability needed for in-lining to our workflow, presents unique requirements apart from other types of service APIs.

Prediction API

The new Prediction API (code-named Sibyl) is currently slated for internal use, but designed for expansion outside the realm of asset processing.  Obviously, we want the API to offer reliable, accurate, and fast prediction of topics when given an asset or text data.  We also want it to offer a pluggable set of methods repeatable for predicting other useful label-type data, such as categories (“class”, “race”, “conference”, etc.), meta-interests (“mud run”, “kids”, “military”, etc.), attributes (“High School”, “10k”, “Beginner”, etc.), and tags (user-derived).  Finally, the prediction technique needed to result in something we could slot right into a nice web service that we could simply call, like a geolocation service.

Right tools for the job

I’m a firm believer that groups need to standardize languages, stacks, and technologies in general.  However, sometimes a project requires a foreign technology to be done right.  At ACTIVE, we recognize the business need for tech consolidation, but we also realize that any project that seems to not be a fit for the current stack also arose from a business need.  It’s a series of weighting exercises to decide if we pursue the new tech. 
Tech Requirements
We had several technology and usage needs going into this.  These covered the usual things like database connectivity and web server tools, but there are several additional needs.
First, of course, we need the data processing and algorithms.
1)   A language and/or set of libraries that:
  • → Can process 200k or more documents rapidly (in seconds) for vectorization (n-gram counts, tf/idf, scaling, SVD/PCA, hashing, etc.)
  • → Offers lots of well-maintained and vetted algorithms for classification (SVM and linear SVM, self-organizing maps and other neural network methods, Bayes and other statistical and probabilistic learning techniques, tree-based methods, etc.)
  • → Provides measures of performance, parameter sweeping, and lots of stats
  • → Enables multiclass (as opposed to binary) and multilabel (“green” and “fast” as opposed to “green” or “fast”) prediction options
  • → Amenable to a prediction ensemble setup (combining different algorithms)
Along with the data science, we need to handle linear algebra (matrix math) and multithreading.
2)   Mathematics and performance needs
  • → Sparse matrices support
    • › 150k assets with 20k features = 3x10^9 matrix members
  • → Optimized parallel processing (threads and processes)
    • › We need models in minutes and predictions in milliseconds
To serve up predictions as an API, we need TCP capabilities.
3)   We wanted the same language for the API web server
  • → Fast to support inclusion of prediction in the asset processing workflow, or any other place
  • → Support for proxy and reverse proxy (like nginx)
  • → Security
  • → JSON support
  • → Pre-forking and event loop support
To account for data drift, changing consumer and organizer trends, product shifts, etc., we need the system to self-correct (mostly).
4)   We want to be able to retrain and deploy new models effortlessly:
  • → As new data comes in
    • › reach a threshold
    • › rebuild models
    • › self-assess
    • › plug and deploy or rebuild with new parameters
  • → As issues arise
Once built, we didn’t want to have to build something else for the next data science project, so it needs to be extendable for the foreseeable future.
5)   Support for future work with the same train-test-deploy workflow
  • → Unsupervised clustering
  • → similarity algorithms for recommendations
  • → custom classification and clustering algorithm development
We also needed the usual suspects.
6)   Database and queue connectivity (Oracle, MongoDB, queues like beanstalkd and MQ implementations, etc.)
7)   Monitoring ability (monit, New Relic, etc.)
8)   Easily understood and maintained
  • → Devs without a data science background should be able to:
    • › plug in new DS-derived functions
    • › create new web server routes and API endpoints
    • › adjust API output
    • › add and apply basic stats and counting
    • › code review
  • → Data scientist devs should be able to extend it easily
So, what languages or libraries offer this?


Mahout and Weka are the go-to guys for machine learning like this in Java.  Our asset processing workflow is a combination of Java, Ruby, node.js, and C, so Java would work for us.  However, we didn’t have the support to build out a Hadoop cluster for which Mahout works at its best, many classification algorithms in Mahout are not yet implemented or awaiting patches, Weka is a bit disorganized, and both are far slower than our needs.  We really couldn’t see putting either of these as a backbone in a production environment.


Our core technology is Ruby-based and Ruby is great for all but the math and machine learning.  We could use the libsvm bindings, write our own with manual linear algebra manipulations, or extend some of existing, unmaintained ones.  This was far to restricting or required too much reinvention.  As interest in data science grows, and as APIs like this become more mainstream, Ruby will probably grow a few really good gems to handle all of this. 


Clearly systems like R and MATLAB have excellent math functionality and machine learning packages.  However, the web and server offerings are quite limited and enterprise-level solutions are far too expensive.


We could easily build things in R or MATLAB rapidly, export the models and translate them into functions to run prediction in another language (or run queues and daemons for this), and serve up the results in yet another language.  However, too many moving parts and different languages becomes a nightmare with version changes, language updates, and changing requirements.


Python has an enormous scientific community, flush with well-maintained libraries for advanced mathematics, statistics, signal processing, and machine learning (see the SciPy stack).  The mathematics are backed by high-performance C libraries (like ATLAS) and other C-level access is easy. Multithreading and multiprocessing are well-established (and getting better) in Python, and it supports running Java or C code if needed.  Web servers in Python are as easy and (nearly) as performant as they are in node.js.  So, to meet all the required and desired features, we went with Python.
Following are some simplified examples of how easy machine learning as a service in Python can be.  Of course, our full Prediction API is more complicated, but all you devs should be able to see how you can build a full-blown service around this little example.*
In Python, there is a package called scikit learn which covers most of the machine learning we need to do, along with most of data prep and model assessment.  The first step is to get your text data into a vector that can be used in predicting.  For example, let's assume you have text documents for events with a description field (you can snag a ton of events with our Activity Search API or other Activity APIs), and that each event has a topic, and they are in some text file "topics_events.json".  Here is an excerpt:
{ "event": { "topic": "Hip Hop dance", "text": "Pre-School Tumbling/ Hip Hop    " } }
{ "event": { "topic": "Distance running", "text": "Santito Youth Talent Ministry Mile Run/Walk    " } }
{ "event": { "topic": "Yoga", "text": "Pre/Post-Natal Yoga M Yoga is an ideal form of exercise before, during, and after pregnancy, and is safe and nurturing, Maintain strength and flexibility, combat fatigue, swelling, back ache and nausea, calm nerves and increase relaxation while reducing common discomforts. Moms and babies welcome. Instructor: Dana Chamblin.  " } }
{ "event": { "topic": "Dance", "text": "Cardio Line Dancing at Haines This activity takes line dancing to a whole new level. Get a cardiovascular workout and learn a variety of moves and experience many genres of music.  " } }
{ "event": { "topic": "Distance running", "text": "The 32nd Annual Skunk Cabbage Classic Run Preregistration: $20, must be postmarked by Friday, February 15, 2013; $25.00 from February16, 2013-April 8th, 2013. Race day registration $35 until 9:45 a.m. race day.  " } }
{ "event": { "topic": "Photography", "text": "Photography Class with Ron St. Germain Whether your camera is an old one from the closet or the newest technology, this class will familiarize you with all of its buttons and functions. You will learn the basics in a fun and easily understood way with entertaining slide presentations and plenty of time to ask questions each week from 5-time International Award winning outdoor photographer, Ron St. Germain. For detailed information, check his website at .  " } }
{ "event": { "topic": "Creative writing", "text": "Memoir Writing (6/27-7/18)    Memoirs are your memories. Learn how to convert your memories into interesting stories to pass down to future generations. Participants will learn how to connect with the great, great, great grandchildren that they will never meet and show them what their lives were like.   " } }
{ "event": { "topic": "Yoga", "text": " Fall Exercise, 01a Mommy   Me (Mon)    " } }
First, we read it into a manageable object while cleaning text and separating out the labels (topics for this example).
In [1]:
import json, re

def clean(txt):
    #clean your data (strip tags, remove number-only words, etc.)
    return txt

def loadCorpus(data_file):
    labels = []
    texts = []
    #load your data
    events = open(data_file).readlines()
    cnt = 0
    processed = 0
    for d in xrange(len(events)):
        cnt += 1
        event = re.sub(r'(\n|\r)+','',events[d].strip())
          json_event = json.loads(event, 'utf-8')
          event = json_event["event"]
          processed += 1
        except Exception, e:
    print str(processed) + " events processed out of " + str(cnt) + " (" + str(float(processed)/float(cnt)) + ")!"
    return labels,texts

labels_train,data_train = loadCorpus("topics_events.json")
116717 events processed out of 120475 (0.968806806391)!

Now let's index the labels and split out chunks for model training and model testing:
In [2]:
import numpy as np

# index the topics, getting unique entries with Python's set object
unique_topics = list(set(labels_train))
labels_train = np.array(labels_train)
labels = np.empty(shape=labels_train.shape)
for c in xrange(len(unique_topics)):
    # use numpy's where function to find and index topics
    labels[np.where(labels_train == unique_topics[c])] = c

X_train = data_train
# set aside test data using about 5% of the full data set
data_len = len(X_train)
tsn = int(data_len*0.05)
# generate a random set of indexes to pluck out for testing
test_samp = np.random.randint(0,data_len,tsn)
# use Python's list comprehension so pull out test data and labels
X_test = [ X_train[i] for i in test_samp ]
Y_test = [ labels[i] for i in test_samp ]
# remove the test data from the training data
Y_train = list(labels)
for s in sorted(test_samp,reverse=True):
    del X_train[s]
    del Y_train[s]

print str(len(X_train)) + " training docs, " + str(len(X_test)) + " testing docs"
print "Topics:"
[str(ut) for ut in unique_topics]
110882 training docs, 5835 testing docs
 'American football',
 'Aquatic sports',
 'Ballroom dance',
 'Cake decorating',
 'Card games',
 'Child care',
 'Creative writing',
 'Cross country running',
 'Cross country skiing',
 'Digital photography',
 'Distance running',
 'Drawing and drafting',
 'Field hockey',
 'Figure skating',
 'First aid and CPR',
 'Flag football',
 'French horn',
 'Hip Hop dance',
 'Ice hockey',
 'Ice skating',
 'Jazz dance',
 'Jewelry making',
 'Mountain biking',
 'Mud running',
 'Percussion and Drumming',
 'Pottery and ceramics',
 'Reading and writing',
 'Sexual health',
 'Strength training',
 'Table tennis',
 'Tae Kwon Do',
 'Tai chi',
 'Tap dance',
 'Tending animals',
 'Trail running',
 'Voice and singing',
 'Wood work',

Then we need to convert our text data into some measureable values that the predictive algorithms can use ("vectorize" the data), and that we think might be predictive.  For this example, we'll tokenize the text with sklearn's CountVectorizer into n-grams of 1-, 2-, and 3-token lengths (ngram_range) and only take tokens appearing in 2 or more documents ("document frequency", min_df):  
In [3]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english',charset_error='ignore',ngram_range=(1,3),min_df=2) 
X_train = vectorizer.fit_transform(X_train)
<110882x479646 sparse matrix of type '<type 'numpy.int64'>'
	with 5200668 stored elements in COOrdinate format>
In [4]:
X_test = vectorizer.transform(X_test)
<5835x479646 sparse matrix of type '<type 'numpy.int64'>'
	with 247403 stored elements in COOrdinate format>
Note that these are now sparse matrices. If you compare the shape of the matrix with the number of elements actually stored, you can understand how useful this type of data representation is.

Now we can build a predictive model.  We'll just use the probabilistic Naive Bayes method good for this type of data, with sklearn's MultinomialNB predictor:
In [5]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB(alpha=.01), Y_train)
MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)
Save off the built model so we don't have to repeat the build if we want to do more things with it:
In [6]:
from sklearn.externals import joblib
joblib.dump(model, "models/topics.MultinomialNB.pkl")
Now we can do some prediction on the test set, which the model has never seen, and output the index of the best predicted topic for each entry in the test data:
In [7]:
# do the prediction
pred = model.predict(X_test)
print pred
[ 64.   4.  42. ...,  43.  58.  26.]

Finally, we can review the performance of the model:
In [8]:
# get some performance metrics
from sklearn import metrics

# scores per class, output not printed here to save space
#print  metrics.f1_score(Y_test, pred, average=None)
#print  metrics.recall_score(Y_test, pred, average=None) 
#print  metrics.precision_score(Y_test, pred, average=None) 

# overall scores
print  metrics.f1_score(Y_test, pred)
print  metrics.accuracy_score(Y_test,pred)

# performance by topic
print  metrics.classification_report(Y_test, pred,target_names=unique_topics)
print "Confusion Matrix:"
print  metrics.confusion_matrix(Y_test, pred)
                         precision    recall  f1-score   support

                 Acting       0.97      1.00      0.99        36
               Aerobics       0.77      0.73      0.75        37
                 Aikido       1.00      1.00      1.00         5
      American football       1.00      1.00      1.00        13
         Aquatic sports       0.65      0.76      0.70       186
                Archery       0.97      1.00      0.99        37
              Badminton       1.00      0.93      0.97        15
                 Ballet       0.84      0.97      0.90       301
         Ballroom dance       0.80      0.94      0.86        47
                Bassoon       0.25      1.00      0.40         2
                Bowling       0.94      0.94      0.94        18
                 Boxing       1.00      0.86      0.92         7
                 Bridge       0.90      1.00      0.95         9
                    CPR       0.71      0.79      0.75        19
        Cake decorating       0.89      0.67      0.76        12
             Card games       1.00      1.00      1.00         2
                  Cello       0.72      1.00      0.84        18
                  Chess       0.92      1.00      0.96        12
             Child care       0.96      0.90      0.93       156
                Chinese       0.93      0.87      0.90        15
               Clarinet       0.86      1.00      0.92         6
       Creative writing       0.85      1.00      0.92        28
  Cross country running       0.83      0.83      0.83         6
   Cross country skiing       0.62      1.00      0.77         5
                  Dance       0.90      0.75      0.82       425
    Digital photography       0.76      0.93      0.84        14
       Distance running       0.82      0.99      0.90        97
                 Diving       0.92      1.00      0.96        11
   Drawing and drafting       0.94      0.93      0.94       105
                Driving       1.00      1.00      1.00        15
                Fencing       0.83      1.00      0.91        10
           Field hockey       0.93      0.93      0.93        15
         Figure skating       0.88      0.88      0.88         8
      First aid and CPR       0.85      0.88      0.87        33
          Flag football       1.00      0.92      0.96        38
                  Flute       0.62      1.00      0.76        13
                 French       0.67      1.00      0.80         6
            French horn       0.71      0.83      0.77         6
              Gardening       1.00      0.83      0.91        12
                 Guitar       0.88      0.90      0.89        87
          Hip Hop dance       0.83      0.92      0.87        78
             Ice hockey       1.00      0.82      0.90        11
            Ice skating       0.98      0.98      0.98        88
                 Improv       0.87      0.94      0.91        36
                Italian       1.00      1.00      1.00         7
             Jazz dance       0.71      0.79      0.75        62
         Jewelry making       0.90      1.00      0.95        18
              Jiu-jitsu       0.82      1.00      0.90         9
                   Judo       0.50      0.50      0.50         2
                 Karate       0.98      0.99      0.98        80
               Kayaking       0.91      0.83      0.87        12
             Kickboxing       1.00      1.00      1.00        11
               Knitting       0.86      0.86      0.86         7
           Lifeguarding       0.89      0.93      0.91        43
                  Magic       0.83      1.00      0.91         5
                Massage       1.00      0.89      0.94         9
        Mountain biking       0.91      0.78      0.84        27
            Mud running       1.00      0.82      0.90        11
                  Music       0.91      0.73      0.81       350
               Painting       0.93      0.93      0.93       134
Percussion and Drumming       0.90      0.90      0.90        30
            Photography       0.95      0.83      0.89        24
                  Piano       0.88      0.92      0.90       104
                Pilates       0.85      0.90      0.88        31
   Pottery and ceramics       0.96      0.94      0.95        69
               Quilting       0.94      1.00      0.97        17
    Reading and writing       1.00      0.68      0.81        22
                Sailing       1.00      1.00      1.00         7
              Saxophone       0.75      0.75      0.75        12
              Sculpture       0.71      0.92      0.80        13
           Self-defence       0.97      1.00      0.98        28
                 Sewing       1.00      1.00      1.00        26
          Sexual health       1.00      1.00      1.00         6
          Skateboarding       1.00      1.00      1.00         9
                 Skiing       0.89      0.93      0.91        42
           Snowboarding       1.00      0.89      0.94        19
            Snowshoeing       1.00      0.75      0.86        12
                Spanish       0.99      0.97      0.98       148
      Strength training       1.00      0.75      0.86        20
                Surfing       1.00      0.92      0.96        12
               Swimming       0.95      0.90      0.92      1117
           Table tennis       0.62      1.00      0.77         5
            Tae Kwon Do       1.00      0.98      0.99        42
                Tai chi       1.00      1.00      1.00        52
              Tap dance       0.77      0.85      0.81        48
                  Taxes       1.00      1.00      1.00        13
        Tending animals       0.98      0.98      0.98        51
                 Tennis       0.97      0.99      0.98       233
                Theater       0.80      0.83      0.81        29
          Trail running       0.64      0.78      0.70         9
               Trombone       0.75      1.00      0.86        15
                Trumpet       0.50      1.00      0.67         7
                   Tuba       0.82      1.00      0.90         9
               Tumbling       0.81      0.98      0.89        52
                  Viola       0.65      1.00      0.79        13
                 Violin       0.78      0.89      0.83        35
      Voice and singing       0.62      0.78      0.69        36
              Wood work       1.00      0.94      0.97        16
              Wrestling       0.84      0.94      0.89        17
                   Yoga       0.98      0.98      0.98       312
                  Zumba       0.91      0.98      0.95       186

            avg / total       0.91      0.90      0.90      5835

Confusion Matrix:
[[ 36   0   0 ...,   0   0   0]
 [  0  27   0 ...,   0   1   2]
 [  0   0   5 ...,   0   0   0]
 [  0   0   0 ...,  16   0   0]
 [  0   0   0 ...,   0 307   0]
 [  0   0   0 ...,   0   0 182]]

To serve up prediction, all we need to do is create a server and route, start it up, and hit the endpoint. We'll use the Bottle framework for this example, but for real use you probably want to run with an event loop API like gevent and manage it with a pre-forker like gunicorn (or run it all with uWSGI) and sit it behind nginx.
In [9]:
from bottle import Bottle, run, request, abort, error,HTTPResponse, HTTPError
# initialize the server
app = Bottle()
# define a route to take some GET data
def runPrediction(text_data=''):
    pred_results = {}
    probabilities = []
    pred = None
    probs = None
    pred_results["data"] = text_data
    topic_nm = ""
    topic = None
        X_data = vectorizer.transform([text_data])
        # rerun prediction to get probabilities
        if "predict_proba" in dir(model):
                probs = model.predict_proba(X_data)[0]
                topic = np.argmax(probs)
            pred = model.predict(X_data) 
            topic = int(pred[0])
        topic_nm = unique_topics[topic]
    if probs is not None:
      for pr in xrange(len(probs)):
          prob = round(probs[pr],2)
          # only show if the probability is > 0
          if prob > 0:
            probabilities.append({ "topic": unique_topics[pr], "confidence": prob })
    pred_results["confidence"] = sorted(probabilities, key=lambda k: k["confidence"], reverse=True)
    pred_results["suggestion"] = { "topic": topic_nm, "confidence": round(probs[topic],2) }
    return json.JSONEncoder().encode(pred_results)
In [10]:
# start the server
# hit "http://localhost:3000/pred/topic/I like to run with my socks on and in the rain with a soccer ball and umbrella ok" to predict on that text
run(app, host="localhost", port=3000, debug=True)
Bottle v0.11.6 server starting up (using WSGIRefServer())...
Listening on http://localhost:3000/
Hit Ctrl-C to quit.

localhost.localdomain - - [24/Oct/2013 00:14:47] "GET /pred/topic/I%20like%20to%20run%20with%20my%20socks%20on%20and%20in%20the%20rain%20with%20a%20soccer%20ball%20and%20umbrella%20ok HTTP/1.1" 200 295

API Response:
	"confidence" : [{
			"topic" : "Distance running",
			"confidence" : 0.49
		}, {
			"topic" : "Mud running",
			"confidence" : 0.42
		}, {
			"topic" : "Trail running",
			"confidence" : 0.06
		}, {
			"topic" : "Tennis",
			"confidence" : 0.03
	"data" : "I like to run with my socks on and in the rain with a soccer ball and umbrella ok",
	"suggestion" : {
		"topic" : "Distance running",
		"confidence" : 0.49

We output an overall suggestion, but also the confidence (probabilities) of each topic in case there is other logic you want to implement (like topic weighting, multiple topic assignment, etc.). You can see that it is torn between "Distance running" and "Mud running", which seems appropriate for running in the rain!
Obviously, there are lots of details left out.  In subsequent posts, I'll go into more details for some of these things.  For now, hopefully you've seen that adding some data science means mixing in more considerations for the overall development process, but it turns out to be fairly simple with the right tools.
*Code here runs on Python 2.7.3 64-bit, with numpy 1.6.1, scipy 0.12.0, and sklearn 0.13.1. Ran/formatted with IPython 1.1.0 notebook.