ACTIVE Network API Developer Blog

The Prediction API: An Ensemble

 
Last week I posted an intro to our new Prediction API, with a simplified walk-through in Python, starting from a set of asset descriptions labeled with topics and ending up with a topic prediction web service that accepts text and outputs topics and probabilities. This week we're going to extend that example by adding a few more predictive models and a method to combine them into a stronger overall prediction that can capture variability in the algorithms and the data. Since this isn't really a math, stats, or machine learning blog, I'll give quick overviews of the algorithms and link to more details in footnotes. [tldr]
 

Building a Predictive Model

There are three basic steps to building a predictive model:
  1. Extract and process features from your data. For these examples, this means pulling out n-grams from our asset descriptions and vectorizing them. I'll post more about various methods for this later.
  2. Separate out training data to cover all your labels and a portion for testing. Here, our labels are the asset topics. There are many ways to do this, but for simplicity we just set aside a random portion of all the data.
  3. Run the training data through the algorithm many times with varying parameters and predict labels with your test data, assessing the model with some statistics or metrics. Repeat until your model performs to your liking.
Once the model is built, you can save it off and load it up as needed.
 

Multinomial Naive Bayes (NB) Classifier

The naive Bayes1 is a probabilistic model, so for every class (topic here) we have, the model gives a percent likelihood that some data should be labeled with that class. That likelihood is based on a given set of features we extract from the data. In the last post, we extracted n-grams and created a vector of n-gram counts and used them as features. The inverse concept is that each feature has a particular probability of being part of class. NB makes use of these relationships to learn the overall probabilities of each feature for each class.
class_index(feat1,,featn)=argmaxc p(C=c)i=0np(Fi=feati|C=c).


NB makes some assumptions, including that the probability of a feature for a particular class is independent of the probabilities of the other features for that or any other class. This is counter-intuitive. After all, wouldn't the co-occurrence of both "mud" and "run" be a better indicator of a "mud run", and isn't it unlikely that "mud" would occur without "run" in this context? Well, we capture this in part by taking n-grams of 1, 2, and 3 tokens, but the assumption of independence actually proves to be less of an issue than you might think. More importantly, it means we don't need to calculate the cross-correlation of all the features and that saves lots of processing time.
 

The Support Vector Machine (SVM)

Support vector machines2 are popular and very powerful methods for classification. The general idea is to draw some sort of separating line between populations of data which maximizes their separation, given a set of constraints, by maximizing the "margin" — the distance of that separator from the nearest training data point. The data can be mapped to higher dimensions using the "kernel trick" to achieve better separation:
SVM separation and kernel trick

Of course, there are many ways to define and draw a line, but the fastest is a straight line and the linear SVMs are quite fast. By altering variables in the algorithm, you can adjust how this line is drawn, favoring things like good overall separation (large margin) over classification mistakes in the training data (error rate). Adjusting this trade-off is called "regularization", and amounts to adjusting the size and complexity of how the margin is represented. You may still want the best separating line even if some data points are on the wrong side or in the space between. However, as you test lines, you may want to penalize these data points, and you may want to do that differently depending on what the error was. This is done with so called "loss" functions. Varying regularization and loss functions affects results, but also processing time (both in training and in testing, and not necessarily the same way).
 
Stochastic Gradient Descent(SGD) Search
While searching for a solution to a problem like a line to separate data points, a common method is to sample some spot in the data, draw a line, test the separation, and then move intelligently to another spot. Doing this repeatedly over all possible spots would eventually find the best separator, but in large data sets it would take forever. The gradient descent method takes graded steps based on the comparisons during the testing and can eventually make assumptions about spots in the data it didn't visit based on spots it did visit. However, this can still be time-consuming. Another variation called stochastic gradient descent3 tries to estimate areas of the data by repeatedly taking random samples of the data and descending, building up an approximation of the data and the best solution. This process can adjust itself via a learning rate variable and has proven very efficient for large data sets, compared to normal gradient descent. One nice side result of needing sub-samples of the data is that the learning can be done online, meaning you can learn or update the classification as the data comes in.
 
Passive-Aggressive Loss Modification
When you add functions to regularize or compute loss, you are also adding points of variability where you can decide to do more. The Passive-Agressive4 algorithm is one such modifier which acts passively for data points with no loss according to some loss function, and aggressively with data points causing loss, to varying extents, to modify the loss function.
Passive-Aggressive loss modification

This guarantees that there is some non-zero margin at each update and means it can also be performed online. This was an addition Google made to weed out search result spam (2004-2005) and reduced it by 50%.
 

Scoring

Each of these methods has some level of internal scoring used to compare across the data and labels. Typically, to create a classifier out of these algorithms we extract that score and call the best performer the winner. For probabilistic models this is the label with the highest probability, and for SVM this is the label furthest from the separating line (largest margin distance). This is fine for comparing multiple runs on the same algorithm and getting the prediction. However, how do these scores relate to each other, and how can they be compared across algorithms? For example, if an SVM algorithm for a particular prediction suggests "Swimming" with a distance of 6.12 and a Bayes algorithm suggests "Triathlon" with a probability of 0.85, which one should we put the most confidence in? What do we do with negative distances or zero probabilities? Unfortunately, there are several ways to do this and none of them are consensus.
There are also several metrics used to assess the overall performance of a classifier. The typical metrics are precision and recall, and we can combine these two in what is called the F1 score. To account for varying label representation ("support", or how many examples of a label you have in your training data), the F1 score can be weighted by the support for each label.
 

Ensembles

Combining models like this is generally termed an "ensemble" method. To accomplish this in our Prediction API, we do a few things:
  1. Normalize the distances in the SVM models (convert them to a uniform set of percentages)
  2. Adjust the normalized distance by internal probabilities of being on one side or the other of the margin to help compare
  3. Adjust probabilities or distances by the per-label weighted F1 score achieved on testing data for each model
  4. Average these resulting scores over all the models to get an overall top prediction.
The "adjust" step can take several forms, and other things can be done, like calibrating the probabilities to the actual distribution of labels in the test set, or track distances throughout the training process for correlation normalization across all labels, etc. These are all things you can play with once you get the basics running smoothly.
 
So now let's build these...
 
This walk-through picks up on the prior post, veering off at code input step 5.*


We'll use the probabilistic Naive Bayes method good for this type of data, with sklearn's MultinomialNB predictor, as well as two variations of the linear SVM from sklearn: SGDClassifier and PassiveAgressiveClassifier.
In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import SGDClassifier, PassiveAggressiveClassifier
models = []
models.append(MultinomialNB(alpha=.01)) 
models.append(SGDClassifier(alpha=.0001, loss='modified_huber',n_iter=50,n_jobs=-1,random_state=42,penalty='l2'))
models.append(PassiveAggressiveClassifier(C=1,n_iter=50,n_jobs=-1,random_state=42)) 
 
Save off the built models for later use, as well as some stats and metrics about the performance:
In [6]:
from sklearn.externals import joblib
from sklearn import metrics
results = {}
for model in models:
    model_name = type(model).__name__
    model.fit(X_train, Y_train)
    joblib.dump(model, "models/topics."+model_name+".pkl")
    pred = model.predict(X_test)
    class_f1 = metrics.f1_score(Y_test, pred, average=None)
    model_f1 = metrics.f1_score(Y_test, pred)
    model_acc = metrics.accuracy_score(Y_test, pred)
    probs = None
    norm = False
    try:
        probs = model.predict_proba(X_test)
    except:
        pass
    if probs is None:
        try:
            probs = model.decision_function(X_test)
            norm = True
        except:
            print "Unable to extract probabilities"
            pass
        pass
    results[model_name] = { "pred": pred, 
                            "probs": probs, 
                            "norm": norm, 
                            "class_f1": class_f1, 
                            "f1": model_f1, 
                            "accuracy": model_acc }
results    
Out[6]:
{'MultinomialNB': {'accuracy': 0.90317052270779774,
  'class_f1': array([ 0.91176471,  0.78378378,  0.92307692,  1.        ,  0.73226545,
        0.98550725,  1.        ,  0.92727273,  0.83783784,  0.66666667,
        0.92307692,  0.96551724,  0.96969697,  0.7826087 ,  0.81818182,
        0.94117647,  0.80952381,  0.95652174,  0.90740741,  0.72727273,
        0.8       ,  0.90909091,  0.88888889,  0.85714286,  0.82848837,
        0.84615385,  0.91542289,  1.        ,  0.93167702,  0.9375    ,
        1.        ,  0.95238095,  0.90909091,  0.85714286,  1.        ,
        0.9       ,  0.90909091,  0.77777778,  0.9       ,  0.89189189,
        0.92134831,  0.9       ,  0.99386503,  0.96296296,  1.        ,
        0.81481481,  1.        ,  1.        ,  1.        ,  0.9726776 ,
        0.88888889,  0.81481481,  1.        ,  0.93506494,  0.66666667,
        1.        ,  0.9       ,  0.66666667,  0.77727952,  0.92307692,
        0.88461538,  0.86956522,  0.88479263,  0.93023256,  0.95302013,
        0.88888889,  0.91428571,  1.        ,  0.78571429,  0.81818182,
        0.95238095,  0.95238095,  1.        ,  0.8       ,  0.92134831,
        0.75      ,  0.9       ,  0.95686275,  0.84210526,  0.92307692,
        0.92982456,  0.90909091,  0.975     ,  0.98630137,  0.83333333,
        1.        ,  1.        ,  0.99043977,  0.76923077,  0.76923077,
        0.72      ,  0.66666667,  1.        ,  0.92561983,  0.76923077,
        0.86842105,  0.6728972 ,  1.        ,  1.        ,  0.9819376 ,
        0.92030848]),
  'f1': 0.90425650714576233,
  'norm': False,
  'pred': array([ 80.,  34.,  40., ...,  80.,  80.,  99.]),
  'probs': array([[  2.40973619e-073,   4.61021751e-072,   1.49431625e-083, ...,
          7.06802900e-070,   7.87538237e-085,   3.95436935e-092],
       [  3.84477547e-009,   2.73354066e-009,   5.04147141e-009, ...,
          5.64165086e-009,   4.27726628e-010,   6.96643800e-010],
       [  1.35237976e-033,   1.55981122e-026,   5.27750168e-031, ...,
          1.38907003e-031,   1.91871068e-040,   7.90134376e-023],
       ..., 
       [  7.00464283e-021,   1.42085525e-019,   1.08346607e-027, ...,
          8.18649153e-026,   7.52087143e-027,   4.42046005e-023],
       [  1.04982806e-049,   1.33612621e-047,   5.33165094e-045, ...,
          8.20836489e-049,   2.76637762e-060,   1.15624776e-058],
       [  4.77707280e-230,   6.78060805e-227,   2.04462450e-214, ...,
          4.03932473e-212,   1.00000000e+000,   9.62285611e-258]])},
 'PassiveAggressiveClassifier': {'accuracy': 0.91533847472150809,
  'class_f1': array([ 0.86956522,  0.85714286,  1.        ,  1.        ,  0.6450116 ,
        0.98550725,  1.        ,  0.92592593,  0.90909091,  0.90909091,
        1.        ,  0.96551724,  0.96969697,  0.81818182,  0.9       ,
        0.875     ,  0.97142857,  1.        ,  0.92982456,  0.88888889,
        0.85714286,  0.88888889,  1.        ,  1.        ,  0.86111111,
        0.88      ,  0.97354497,  1.        ,  0.92215569,  0.96774194,
        1.        ,  1.        ,  0.76923077,  0.86075949,  0.98876404,
        0.97560976,  1.        ,  0.94117647,  0.94736842,  0.92715232,
        0.92045455,  0.78947368,  0.98159509,  0.94871795,  1.        ,
        0.775     ,  1.        ,  1.        ,  1.        ,  0.97777778,
        0.88      ,  0.88      ,  1.        ,  0.86567164,  1.        ,
        1.        ,  0.84210526,  0.66666667,  0.85557084,  0.93333333,
        0.89361702,  0.91304348,  0.91402715,  0.95348837,  0.94520548,
        0.91304348,  0.8       ,  1.        ,  0.89655172,  0.76923077,
        0.93023256,  0.95081967,  1.        ,  1.        ,  0.93478261,
        0.8       ,  0.95238095,  0.99230769,  0.92307692,  0.96296296,
        0.92156863,  0.90909091,  0.96202532,  1.        ,  0.86746988,
        0.97560976,  0.97297297,  0.98846154,  0.75555556,  0.83333333,
        0.9       ,  0.8       ,  1.        ,  0.94736842,  0.9375    ,
        0.9382716 ,  0.7804878 ,  1.        ,  1.        ,  0.98846787,
        0.95187166]),
  'f1': 0.91627947823867106,
  'norm': True,
  'pred': array([ 80.,  34.,  24., ...,   4.,  80.,  99.]),
  'probs': array([[-6.50553156, -7.33053267, -5.86251708, ..., -6.3615013 ,
        -4.61827337, -4.99723938],
       [-1.42865833, -1.30183262, -1.276344  , ..., -1.28003285,
        -1.32275774, -1.53379058],
       [-1.86725075, -2.50969362, -1.77919327, ..., -1.79010457,
        -1.74044571, -1.64161697],
       ..., 
       [-2.10958176, -1.57150069, -2.75524001, ..., -3.05243249,
        -2.32612049, -2.33553294],
       [-2.85613606, -5.13216583, -3.12312947, ..., -3.39208034,
        -3.37134486, -2.82044267],
       [-4.74802868, -8.87368108, -5.07165789, ..., -4.46875551,
         4.16080753, -6.87427149]])},
 'SGDClassifier': {'accuracy': 0.91088260497000861,
  'class_f1': array([ 0.92307692,  0.79411765,  1.        ,  1.        ,  0.67613636,
        0.97142857,  1.        ,  0.93883792,  0.94117647,  0.71428571,
        0.97142857,  0.96551724,  0.96969697,  0.80952381,  0.94736842,
        0.82352941,  0.94444444,  0.97777778,  0.91076923,  0.88888889,
        0.94117647,  0.93333333,  1.        ,  0.85714286,  0.84709066,
        0.84615385,  0.94972067,  1.        ,  0.93251534,  0.85714286,
        1.        ,  1.        ,  1.        ,  0.85333333,  1.        ,
        0.95238095,  1.        ,  0.94117647,  0.94736842,  0.91275168,
        0.93491124,  0.81081081,  0.98765432,  0.94871795,  1.        ,
        0.79487179,  0.96774194,  1.        ,  1.        ,  0.97206704,
        0.83333333,  0.84615385,  0.95652174,  0.88888889,  1.        ,
        1.        ,  0.83333333,  0.66666667,  0.82099596,  0.91479821,
        0.83333333,  0.88372093,  0.92727273,  0.94382022,  0.97183099,
        0.93023256,  0.875     ,  0.95652174,  0.93333333,  0.86956522,
        0.95238095,  0.95081967,  0.8       ,  1.        ,  0.87058824,
        0.8       ,  0.95238095,  0.9765625 ,  0.86486486,  0.96296296,
        0.91327148,  0.66666667,  0.98734177,  0.95652174,  0.89156627,
        0.97560976,  0.96363636,  0.97475728,  0.80952381,  0.8       ,
        0.95238095,  0.75      ,  1.        ,  0.96491228,  0.9375    ,
        0.95121951,  0.72941176,  1.        ,  1.        ,  0.98839138,
        0.94933333]),
  'f1': 0.91022607629402619,
  'norm': True,
  'pred': array([ 80.,  34.,  40., ...,  80.,  80.,  99.]),
  'probs': array([[ -2.26147719,  -8.27192811,  -4.92650663, ...,  -6.28101298,
         -4.394243  ,  -5.40679641],
       [ -5.28684017,  -6.85913339,  -6.07070456, ...,  -6.38595219,
         -4.88673419,  -8.5894273 ],
       [ -5.44502194,  -7.45632928,  -6.12684297, ...,  -6.38543615,
         -5.01141554,  -8.79847179],
       ..., 
       [ -3.62707077,  -5.61320527,  -5.65215132, ...,  -6.35567368,
         -5.028778  ,  -8.08235843],
       [ -5.02315463,  -7.26758048,  -5.8154445 , ...,  -6.52950791,
         -5.16125362,  -9.12372876],
       [ -6.22352808,  -7.47187108,  -6.1562682 , ...,  -6.45018436,
          6.56404494, -11.14563358]])}}
 
Let's define the scoring functions: compute_confidence performs the first portion to convert distances to a measure of confidence (or use probabilities if available), and then format them. normalize_distances does the conversion we need for SVM. predict does the prediction and computes the confidence.
In [7]:
def normalize_distances(probs):
    # get the probabilities of a positive and negative distance
    pPos = float(len(np.where(np.array(probs) > 0)[0])) / float(len(probs))
    pNeg = float(len(np.where(np.array(probs) < 0)[0])) / float(len(probs))
    if pNeg > 0:
        probs[np.where(probs > 0)] *= pNeg
    if pPos > 0:
        probs[np.where(probs < 0)] *= pPos
    # subtract the mean and divide by standard deviation if non-zero
    probs_std = np.std(probs, ddof=1)
    if probs_std != 0:
        probs = (probs - np.mean(probs)) / probs_std
    # divide by the range if non-zero
    pDiff = np.abs(np.max(probs) - np.min(probs))
    if pDiff != 0:
        probs /= pDiff
    return probs

def compute_confidence(probs, labels, norm=False):
    confidences = None
    class_confidence = []
    if norm:
        confidences = normalize_distances(probs)
    else:
        confidences = probs
    for c in xrange(len(confidences)):
        confidence = round(confidences[c], 2)
        if confidence != 0:
            class_confidence.append({ "label": labels[c], "confidence": confidence })
    return sorted(class_confidence, key=lambda k: k["confidence"], reverse=True)

def predict(txt, model, model_name, vectorizer, labels):
    predObj = {}
    X_data = vectorizer.transform([txt])
    pred = model.predict(X_data)
    label_id = int(pred[0])
    predObj["model"] = model_name
    predObj["label_name"] = labels[label_id]
    predObj["f1_score"] = float(results[model_name]["class_f1"][label_id])
    if results[model_name]["norm"]:
        probs = model.decision_function(X_data)[0]
    else:
        probs = model.predict_proba(X_data)[0]
    predObj["confidence"] = compute_confidence(probs, labels, norm=results[model_name]["norm"])
    return predObj
    
    
        
 
To do ensemble prediction, first we run the predict method for each model:
In [8]:
prediction_results = []
label_scores = None
model_names = []
txt = "This is a tough mud run. Tough, as in, this could be one of the hardest events you, as a runner have ever attempted. This is not a walk in the park. This is not your average neighborhood 5k. This is more fun than a marathon. This is a challenge. The obstacles were designed by military and fitness experts and will test you to the max. Push personal limits while running, crawling, climbing, jumping, dragging, and other surprise tasks that test endurance and strength. The race consists of non-competitive heats as well as a free kid's course for children age 5-13. Each heat will depart in a specific wave so the course doesn't get overcrowded."
for model in models:
    model_name = type(model).__name__
    try:
        prediction_results.append( predict(txt,
                                    model,model_name, vectorizer, unique_topics))
        model_names.append(model_name)
    except Exception, e:
        print "Error predicting" + str(e)
        pass
prediction_results
Out[8]:
[{'confidence': [{'confidence': 1.0, 'label': u'Mud running'}],
  'f1_score': 0.6666666666666665,
  'label_name': u'Mud running',
  'model': 'MultinomialNB'},
 {'confidence': [{'confidence': 0.94, 'label': u'Distance running'},
   {'confidence': 0.14, 'label': u'Mud running'},
   {'confidence': 0.01, 'label': u'Bassoon'},
   {'confidence': 0.01, 'label': u'Dance'},
   {'confidence': 0.01, 'label': u'Judo'},
   {'confidence': 0.01, 'label': u'Sexual health'},
   {'confidence': 0.01, 'label': u'Strength training'},
   {'confidence': 0.01, 'label': u'Tuba'},
   {'confidence': -0.01, 'label': u'Acting'},
   {'confidence': -0.01, 'label': u'Aikido'},
   {'confidence': -0.01, 'label': u'Bowling'},
   {'confidence': -0.01, 'label': u'Chess'},
   {'confidence': -0.01, 'label': u'Child care'},
   {'confidence': -0.01, 'label': u'Chinese'},
   {'confidence': -0.01, 'label': u'Cross country skiing'},
   {'confidence': -0.01, 'label': u'Digital photography'},
   {'confidence': -0.01, 'label': u'Diving'},
   {'confidence': -0.01, 'label': u'Drawing and drafting'},
   {'confidence': -0.01, 'label': u'Driving'},
   {'confidence': -0.01, 'label': u'Fencing'},
   {'confidence': -0.01, 'label': u'Flag football'},
   {'confidence': -0.01, 'label': u'Ice hockey'},
   {'confidence': -0.01, 'label': u'Italian'},
   {'confidence': -0.01, 'label': u'Kayaking'},
   {'confidence': -0.01, 'label': u'Knitting'},
   {'confidence': -0.01, 'label': u'Painting'},
   {'confidence': -0.01, 'label': u'Percussion and Drumming'},
   {'confidence': -0.01, 'label': u'Pilates'},
   {'confidence': -0.01, 'label': u'Reading and writing'},
   {'confidence': -0.01, 'label': u'Sculpture'},
   {'confidence': -0.01, 'label': u'Self-defence'},
   {'confidence': -0.01, 'label': u'Skateboarding'},
   {'confidence': -0.01, 'label': u'Table tennis'},
   {'confidence': -0.01, 'label': u'Taxes'},
   {'confidence': -0.01, 'label': u'Tending animals'},
   {'confidence': -0.01, 'label': u'Theater'},
   {'confidence': -0.01, 'label': u'Violin'},
   {'confidence': -0.01, 'label': u'Wrestling'},
   {'confidence': -0.02, 'label': u'Aerobics'},
   {'confidence': -0.02, 'label': u'Ballroom dance'},
   {'confidence': -0.02, 'label': u'CPR'},
   {'confidence': -0.02, 'label': u'Cello'},
   {'confidence': -0.02, 'label': u'Cross country running'},
   {'confidence': -0.02, 'label': u'First aid and CPR'},
   {'confidence': -0.02, 'label': u'Improv'},
   {'confidence': -0.02, 'label': u'Mountain biking'},
   {'confidence': -0.02, 'label': u'Photography'},
   {'confidence': -0.02, 'label': u'Pottery and ceramics'},
   {'confidence': -0.02, 'label': u'Snowboarding'},
   {'confidence': -0.02, 'label': u'Spanish'},
   {'confidence': -0.02, 'label': u'Tae Kwon Do'},
   {'confidence': -0.02, 'label': u'Tap dance'},
   {'confidence': -0.02, 'label': u'Tennis'},
   {'confidence': -0.02, 'label': u'Tumbling'},
   {'confidence': -0.03, 'label': u'Aquatic sports'},
   {'confidence': -0.03, 'label': u'Creative writing'},
   {'confidence': -0.03, 'label': u'Flute'},
   {'confidence': -0.03, 'label': u'Ice skating'},
   {'confidence': -0.03, 'label': u'Jewelry making'},
   {'confidence': -0.03, 'label': u'Karate'},
   {'confidence': -0.03, 'label': u'Skiing'},
   {'confidence': -0.03, 'label': u'Voice and singing'},
   {'confidence': -0.03, 'label': u'Zumba'},
   {'confidence': -0.04, 'label': u'Ballet'},
   {'confidence': -0.04, 'label': u'Hip Hop dance'},
   {'confidence': -0.04, 'label': u'Jazz dance'},
   {'confidence': -0.04, 'label': u'Lifeguarding'},
   {'confidence': -0.05, 'label': u'Guitar'},
   {'confidence': -0.06, 'label': u'Piano'}],
  'f1_score': 0.9497206703910615,
  'label_name': u'Distance running',
  'model': 'SGDClassifier'},
 {'confidence': [{'confidence': 0.6, 'label': u'Mud running'},
   {'confidence': 0.43, 'label': u'Distance running'},
   {'confidence': 0.25, 'label': u'Strength training'},
   {'confidence': 0.24, 'label': u'Chess'},
   {'confidence': 0.21, 'label': u'Cross country running'},
   {'confidence': 0.2, 'label': u'Lifeguarding'},
   {'confidence': 0.17, 'label': u'Sailing'},
   {'confidence': 0.17, 'label': u'Trail running'},
   {'confidence': 0.16, 'label': u'Mountain biking'},
   {'confidence': 0.16, 'label': u'Skateboarding'},
   {'confidence': 0.15, 'label': u'Zumba'},
   {'confidence': 0.14, 'label': u'Acting'},
   {'confidence': 0.14, 'label': u'Yoga'},
   {'confidence': 0.12, 'label': u'Badminton'},
   {'confidence': 0.11, 'label': u'Snowboarding'},
   {'confidence': 0.11, 'label': u'Surfing'},
   {'confidence': 0.1, 'label': u'Cake decorating'},
   {'confidence': 0.1, 'label': u'Karate'},
   {'confidence': 0.09, 'label': u'Archery'},
   {'confidence': 0.09, 'label': u'Drawing and drafting'},
   {'confidence': 0.09, 'label': u'Judo'},
   {'confidence': 0.09, 'label': u'Wood work'},
   {'confidence': 0.08, 'label': u'Kayaking'},
   {'confidence': 0.08, 'label': u'Piano'},
   {'confidence': 0.07, 'label': u'Aikido'},
   {'confidence': 0.07, 'label': u'Snowshoeing'},
   {'confidence': 0.07, 'label': u'Tai chi'},
   {'confidence': 0.06, 'label': u'Diving'},
   {'confidence': 0.06, 'label': u'Ice skating'},
   {'confidence': 0.06, 'label': u'Quilting'},
   {'confidence': 0.06, 'label': u'Tae Kwon Do'},
   {'confidence': 0.05, 'label': u'Chinese'},
   {'confidence': 0.04, 'label': u'Flute'},
   {'confidence': 0.04, 'label': u'French'},
   {'confidence': 0.04, 'label': u'Kickboxing'},
   {'confidence': 0.04, 'label': u'Percussion and Drumming'},
   {'confidence': 0.04, 'label': u'Pilates'},
   {'confidence': 0.04, 'label': u'Wrestling'},
   {'confidence': 0.03, 'label': u'Bridge'},
   {'confidence': 0.03, 'label': u'Driving'},
   {'confidence': 0.03, 'label': u'Fencing'},
   {'confidence': 0.03, 'label': u'Guitar'},
   {'confidence': 0.03, 'label': u'Tap dance'},
   {'confidence': 0.02, 'label': u'Cello'},
   {'confidence': 0.02, 'label': u'Digital photography'},
   {'confidence': 0.02, 'label': u'Saxophone'},
   {'confidence': 0.02, 'label': u'Tumbling'},
   {'confidence': 0.01, 'label': u'American football'},
   {'confidence': 0.01, 'label': u'Bowling'},
   {'confidence': 0.01, 'label': u'French horn'},
   {'confidence': 0.01, 'label': u'Improv'},
   {'confidence': 0.01, 'label': u'Jazz dance'},
   {'confidence': 0.01, 'label': u'Sexual health'},
   {'confidence': 0.01, 'label': u'Taxes'},
   {'confidence': -0.01, 'label': u'First aid and CPR'},
   {'confidence': -0.01, 'label': u'Tennis'},
   {'confidence': -0.01, 'label': u'Trombone'},
   {'confidence': -0.01, 'label': u'Trumpet'},
   {'confidence': -0.02, 'label': u'Ballroom dance'},
   {'confidence': -0.02, 'label': u'Cross country skiing'},
   {'confidence': -0.02, 'label': u'Field hockey'},
   {'confidence': -0.02, 'label': u'Magic'},
   {'confidence': -0.02, 'label': u'Pottery and ceramics'},
   {'confidence': -0.02, 'label': u'Viola'},
   {'confidence': -0.02, 'label': u'Violin'},
   {'confidence': -0.03, 'label': u'Self-defence'},
   {'confidence': -0.04, 'label': u'Photography'},
   {'confidence': -0.05, 'label': u'Ballet'},
   {'confidence': -0.05, 'label': u'Italian'},
   {'confidence': -0.06, 'label': u'Boxing'},
   {'confidence': -0.08, 'label': u'Gardening'},
   {'confidence': -0.08, 'label': u'Hip Hop dance'},
   {'confidence': -0.09, 'label': u'Spanish'},
   {'confidence': -0.1, 'label': u'Table tennis'},
   {'confidence': -0.1, 'label': u'Theater'},
   {'confidence': -0.11, 'label': u'Jewelry making'},
   {'confidence': -0.14, 'label': u'Reading and writing'},
   {'confidence': -0.15, 'label': u'Creative writing'},
   {'confidence': -0.15, 'label': u'Ice hockey'},
   {'confidence': -0.15, 'label': u'Voice and singing'},
   {'confidence': -0.16, 'label': u'Figure skating'},
   {'confidence': -0.17, 'label': u'CPR'},
   {'confidence': -0.18, 'label': u'Painting'},
   {'confidence': -0.19, 'label': u'Tending animals'},
   {'confidence': -0.25, 'label': u'Flag football'},
   {'confidence': -0.27, 'label': u'Dance'},
   {'confidence': -0.27, 'label': u'Skiing'},
   {'confidence': -0.28, 'label': u'Aquatic sports'},
   {'confidence': -0.32, 'label': u'Child care'},
   {'confidence': -0.36, 'label': u'Aerobics'},
   {'confidence': -0.36, 'label': u'Swimming'},
   {'confidence': -0.37, 'label': u'Music'},
   {'confidence': -0.4, 'label': u'Sculpture'}],
  'f1_score': 0.6666666666666666,
  'label_name': u'Mud running',
  'model': 'PassiveAggressiveClassifier'}]
 
Finally, we do the weighted F1 adjustment:
In [9]:
from operator import itemgetter
conf_labels = {}
scored_labels = {}
for prediction_result in prediction_results:
    if "label_name" in prediction_result:
        l_name = prediction_result["label_name"]
        f1_score = prediction_result["f1_score"]
        conf = [ float(c["confidence"]) for c in prediction_result["confidence"] if c["label"] == prediction_result["label_name"] ]
        if len(conf):
            conf = conf[0]
        else:
            conf = 0
        # update with confidence weighted by f1
        #variations are to *= f1_score, or update confidence with just the confidence
        if l_name in scored_labels.keys():
            conf_labels[l_name] += f1_score * conf
            scored_labels[l_name] += f1_score
        else:
            conf_labels[l_name] = f1_score * conf
            scored_labels[l_name] = f1_score
labels_sorted = sorted(scored_labels.iteritems(),key=itemgetter(1),reverse=True)
labels_pred = []
# compute ensemble averages
num_models = len(models)
for i in xrange(len(labels_sorted)):
    label = labels_sorted[i][0]
    label_avg_f1score = float("%0.3f" % (float(labels_sorted[i][1]) / num_models))
    label_avg_conf = float("%0.3f" % (float(conf_labels[label]) / num_models))
    label_score = float("%0.3f" % (label_avg_f1score * label_avg_conf))
    labels_pred.append({ "label": label, "score": label_score })
labels_pred = sorted(labels_pred, key=lambda k: k["score"], reverse=True)
labels_pred   
Out[9]:
[{'label': u'Mud running', 'score': 0.158},
 {'label': u'Distance running', 'score': 0.094}]
 
So, there's a quick implementation of an ensemble method combining linear and probabilistic machine learning for text document classification. Next we'll go more into the web service and see how we can build models that perform their prediction faster, and how we can serve up those predictions fast enough for an enterprise-level Prediction API.
 
References
  1. Wikipedia does pretty good with its Naive Bayes page and the references therein. scikit-learn has a good description, as well.
  2. The OpenCV SVM intro page is pretty clear. scikit-learn explains SVMs, too.
  3. Wikipedia is concise and clear on SGD. scikit-learn's page is here.
  4. The canonical reference for online PA.
*Code here runs on Python 2.7.3 64-bit, with numpy 1.6.1, scipy 0.12.0, and sklearn 0.13.1. Ran/formatted with IPython 1.1.0 notepad.