{
 "metadata": {
  "name": "",
  "signature": "sha256:3ff8f42cd731669a23da223a73269b2673561f2a0d2c50d8b6a6e604e6588276"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Soft margins are not as good as hard margins.\n",
      "\n",
      "John Mount\n",
      "[Win-Vector LLC](http://www.win-vector.com/)\n",
      "1-18-2015\n",
      "\n",
      "Consider a soft-margin support vector machine as defined in:\n",
      "   * [Cortes and Vapnik, 1995] Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3) \n",
      "   * [Vapnik, 1998] Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.\n",
      "\n",
      "Soft margin results are derived from hard margin results in\n",
      "corollary on page 408 of section 10.2.1 of [Vapnik1998].\n",
      "The content is essentially that a support vector machine with a given\n",
      "margin can be expected to have an error rate on new data (identically\n",
      "distributed as the training data) that is no more than the error\n",
      "$\\Delta$-margin error rate seen on training data plus a factor that is\n",
      "roughly $VCdimension/n$ where $n$ is the number of training examples.\n",
      "From the definitions of section 10.2.1 it appears the $\\Delta$-margin\n",
      "error rate can be taken to be all errors made on training data\n",
      "including all data that falls too close to the decision surface as\n",
      "also being in error (this appears to be implicit in the definition of\n",
      "a $\\Delta$-margin separating hyperplane).\n",
      "\n",
      "Let's explore a bit how to use such a bound.  For our problem set up the very simple artificial one dimensional problem with $x_i$ being the effective variable and $y_i$ being the outcome.  Our training data is $y_i = +1$ for $x_i$ uniformly distributed in $(0,1]$ and $y_i = -1$ for $x_i$ uniformly distributed in $[-1,0)$.  We imagine we draw $n$ samples independently from these distributions (picking from the + and - classes with equal probability).\n",
      "\n",
      "We are going to make the problem simple and assume we are only estimating a single scalar $w$ and our model is $sign(w \\cdot x)$.  In this case we see for any $w > 0$ we have a perfect model (i.e. see no errors).\n",
      "\n",
      "Let's look how the standard soft-margin SVM procedure would pick a $w$.  The standard soft-margin SVM optimization problem for this set of concepts (left/right division of the number line at the origin) is (see [Vapnik1998] section 10.2.2):\n",
      "\n",
      "\\begin{align*}   \n",
      "   \\text{minimize}:  & \\\\ \n",
      "   &  w \\cdot w/2 + C \\sum_{i=1}^{n} \\xi_i \\\\\n",
      "   \\text{subject to:} & \\\\\n",
      "    & \\xi_i >= 0 \\\\\n",
      "    & y_i (x_i \\cdot w) >= 1 - \\xi_i \n",
      "\\end{align*}   \n",
      "\n",
      "Where $C$ is a parameter usually picked before looking at the data, and often picked as $C=1$.\n",
      "\n",
      "The stated conditions insist $|x_i| w >= 1 - \\xi_i$ and therefore $\\xi_i >= 1 - |x_i| w$.  We then expect the fraction of our data that needs a $\\xi_i > 0$ corrections to be $min(1,1/w)$ and the total expected mass of corrections to  be $C n \\int_{x=0}^{min(1,1/w)} (1-x w) dx$ (we get one copy of this integeral for each side of the margin, and the total set of examples has range 2, which gets divided out).\n",
      "\n",
      "From the theory of SVMs we can bound the error seen on new data to be no more than $\\Delta$-margin error rate seen during training plus a $(1+1/margin^2)/n$ bound on excess generalization error.  Our margin is of radius $1/w$ (measured from the center).  So we expect our error bound to be no more than the empirical error rate of an appropriate hard-margin classifier that is considered wrong in the margin/moat area (or about a $1/w$ fraction of the time) plus the excess error bound ($(1+1/margin^2)/n = (1+w^2)/n$).  So our total error-bound is $1/w+(1+w^2)/n$.  Note we are assuming that the dimension of the concept space the support vector machine is working in is not obviously known (so we can't apply the obvious bound $VCdimension \\le 1$, and instead have to rely on the margin derived portion of the bound).\n",
      "\n",
      "Let's see what value of $C$ will minimize this error bound estimate.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Python ipython\n",
      "import sympy\n",
      "x,w,C,n = sympy.symbols(['x','w','C','n'])\n",
      "\n",
      "NormTerm = w*w/2\n",
      "print('NormTerm',NormTerm)\n",
      "EMarginTerm = C*n*sympy.integrate(1-x*w,(x,0,1/w))\n",
      "print('EMarginTerm',EMarginTerm)\n",
      "Objective = NormTerm + EMarginTerm\n",
      "print('Objective',Objective)\n",
      "Derivative = sympy.diff(Objective,w)\n",
      "print('Derivative',Derivative)\n",
      "OptimalWsGivenC = sympy.solve(Derivative,w)\n",
      "print('OptimalWsGivenC',OptimalWsGivenC)\n",
      "BestWGivenC = [ si for si in OptimalWsGivenC if sympy.N(si.subs([(C,1),(n,1)]))][0]\n",
      "print('BestWGivenC',BestWGivenC)\n",
      "ErrorBound = 1/w + (1 + w**2)/n\n",
      "FnOfC = sympy.simplify(ErrorBound.subs(w,BestWGivenC))\n",
      "print('FnOfC',FnOfC)\n",
      "D2 = sympy.simplify(sympy.diff(FnOfC,C))\n",
      "print('D2',D2)\n",
      "CPicks = sympy.solve(D2,C)\n",
      "print('CPicks',CPicks)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "NormTerm w**2/2\n",
        "EMarginTerm C*n/(2*w)\n",
        "Objective C*n/(2*w) + w**2/2\n",
        "Derivative -C*n/(2*w**2) + w\n",
        "OptimalWsGivenC"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " [2**(2/3)*(C*n)**(1/3)/2, -2**(2/3)*(C*n)**(1/3)/4 - 2**(2/3)*sqrt(3)*I*(C*n)**(1/3)/4, -2**(2/3)*(C*n)**(1/3)/4 + 2**(2/3)*sqrt(3)*I*(C*n)**(1/3)/4]\n",
        "BestWGivenC 2**(2/3)*(C*n)**(1/3)/2\n",
        "FnOfC"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " 2**(1/3)*C/(2*(C*n)**(1/3)) + 2**(1/3)/(C*n)**(1/3) + 1/n\n",
        "D2"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " 2**(1/3)*(C - 1)/(3*C*(C*n)**(1/3))\n",
        "CPicks [1]\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "What we did is look for minima of the optimization objectives (the SVM optimization problem and then the error-bound) by inspecting extreme of these functions by looking at zeros of derivatives.  This let us confirm the common choice $C=1$ is in fact an optimal pick for this toy problem.\n",
      "\n",
      "Let's substitute $C=1$ into our expressions and see what the SVM optimization procedure chooses for $w$, and what generalization bound this gives us."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "CPick = CPicks[0]\n",
      "wPick = BestWGivenC.subs(C,CPick)\n",
      "print('wPick',wPick)\n",
      "print('ErrorBoundw',sympy.simplify(ErrorBound.subs(w,wPick)))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "wPick 2**(2/3)*n**(1/3)/2\n",
        "ErrorBoundw"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " 1/n + 3*2**(1/3)/(2*n**(1/3))\n"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We see with $C=1$ we get $w=2^{2/3} n^{1/3}/2$, which is growing as the number of training examples $n$ increases (but not shrinking very fast).  Our error bound is then $1/n + 3 \\times 2^{1/3}/(2 n^{1/3})$, which is slowly shrinking as $n$ grows.  \n",
      "\n",
      "In fact the error bound is shrinking much slower than one might initially expect.  The issue is our data doesn't have any intrinsic large margin (a common problem) for real data, but we are having to build our own margin using the soft-margin method.\n",
      "\n",
      "For a constant margin VC dimension theory tells us excess generalization error shrinks linearly in $n$, but picking the optimal margin (i.e. shrinking the margin as we get more data) only gives us an overall error-rate bound that is shrinking as $1/n^{1/3}$.  Note this is only the bound, as the actual error-rate is zero for any $w >0$ for this toy problem.\n",
      "\n",
      "For the fixed $C=1$ the optimization algorithm is (correctly) picking smaller margins as more data becomes available.  This shrinking margin unfortunately obscures some of the sample-size driven improvement in generalization error (the excess generalization error or VC dimension part of the bound).  But the trade-off also allows the training algorithm to consider a larger fraction of the training data as being out of the moat (or not in the margin around the decision surface).  This is correct behavior on the part of the optimization procedure, and is shrinking the error bound as fast as possible.  But \"fast\" is much slower than any (incorrect) intuition that a fixed $C$ implies a fixed margin as the amount of training data increases.  Also note, it is not common to actually explicitly calculate the implied error bound when using SVMs (we just used the obvious structure of our toy problem to allow such a calculation).  And further note, we these bounds are approximate as we used simplified forms and not the full equations from the reference (though we are confident we get the overall behavior correct).\n",
      "\n",
      "This means we expect a halving of the error-bound only each time we multiply our available data by a factor of eight.  This is slower than any rule of thumb that states the error-bound shrinks linearly with the amount of training data.  Just keep this in mind when deciding how much data you may need for a good SVM result.\n"
     ]
    },
    {
     "cell_type": "heading",
     "level": 1,
     "metadata": {},
     "source": [
      "----"
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "We could try the same analysis for two normal densitities centered at -1 and +1.  In this case the expected correction mass integral becomes $C n \\int_{x=-min(1,1/w)}^{min(1,1/w)} (1-x w) \\frac{1}{\\sigma \\sqrt{2 \\pi}} \\text{exp}(\\frac{-(1-x)^2}{2 \\sigma^2})dx$"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Python ipython\n",
      "import sympy\n",
      "import math\n",
      "x,w,n = sympy.symbols(['x','w','n'])\n",
      "\n",
      "C = 1.0\n",
      "sigma = 0.5\n",
      "\n",
      "NormTerm = w*w/2\n",
      "print('NormTerm',NormTerm)\n",
      "EMarginTerm = -C*n*sympy.integrate((1-(1-x)*w)*sympy.exp(-x*x/(2*sigma*sigma))/(sigma*math.sqrt(2*math.pi)),(x,1-1/w,1+1/w))\n",
      "print('EMarginTerm',EMarginTerm)\n",
      "Objective = NormTerm + EMarginTerm\n",
      "print('Objective',Objective)\n",
      "Derivative = sympy.diff(Objective,w)\n",
      "print('Derivative',Derivative)\n",
      "for ni in range(1,5):\n",
      "    di = Derivative.subs(n,ni)\n",
      "    print('\\tdi',ni,di)\n",
      "    si = sympy.solvers.nsolve(di,1/(1.0*ni))\n",
      "    print(si)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "NormTerm w**2/2\n",
        "EMarginTerm"
       ]
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        " -1.0*n*(0.199471140200716*w*exp(-2.0*(1 - 1/w)**2) - 0.199471140200716*w*exp(-2.0*(1 + 1/w)**2) + 0.282094791773878*sqrt(pi)*w*erf(1.4142135623731 - 1.4142135623731/w) - 0.282094791773878*sqrt(pi)*w*erf(1.4142135623731 + 1.4142135623731/w) - 0.282094791773878*sqrt(pi)*erf(1.4142135623731 - 1.4142135623731/w) + 0.282094791773878*sqrt(pi)*erf(1.4142135623731 + 1.4142135623731/w))\n",
        "Objective -1.0*n*(0.199471140200716*w*exp(-2.0*(1 - 1/w)**2) - 0.199471140200716*w*exp(-2.0*(1 + 1/w)**2) + 0.282094791773878*sqrt(pi)*w*erf(1.4142135623731 - 1.4142135623731/w) - 0.282094791773878*sqrt(pi)*w*erf(1.4142135623731 + 1.4142135623731/w) - 0.282094791773878*sqrt(pi)*erf(1.4142135623731 - 1.4142135623731/w) + 0.282094791773878*sqrt(pi)*erf(1.4142135623731 + 1.4142135623731/w)) + w**2/2\n",
        "Derivative -1.0*n*(0.199471140200716*exp(-2.0*(1 - 1/w)**2) - 0.199471140200716*exp(-2.0*(1 + 1/w)**2) + 0.282094791773878*sqrt(pi)*erf(1.4142135623731 - 1.4142135623731/w) - 0.282094791773878*sqrt(pi)*erf(1.4142135623731 + 1.4142135623731/w) - 0.797884560802865*(1 - 1/w)*exp(-2.0*(1 - 1/w)**2)/w - 0.797884560802865*(1 + 1/w)*exp(-2.0*(1 + 1/w)**2)/w + 0.797884560802865*exp(-(1.4142135623731 + 1.4142135623731/w)**2)/w + 0.797884560802865*exp(-(1.4142135623731 - 1.4142135623731/w)**2)/w - 0.797884560802865*exp(-(1.4142135623731 + 1.4142135623731/w)**2)/w**2 - 0.797884560802865*exp(-(1.4142135623731 - 1.4142135623731/w)**2)/w**2) + w\n",
        "\tdi 1 w - 0.199471140200716*exp(-2.0*(1 - 1/w)**2) + 0.199471140200716*exp(-2.0*(1 + 1/w)**2) - 0.282094791773878*sqrt(pi)*erf(1.4142135623731 - 1.4142135623731/w) + 0.282094791773878*sqrt(pi)*erf(1.4142135623731 + 1.4142135623731/w) + 0.797884560802865*(1 - 1/w)*exp(-2.0*(1 - 1/w)**2)/w + 0.797884560802865*(1 + 1/w)*exp(-2.0*(1 + 1/w)**2)/w - 0.797884560802865*exp(-(1.4142135623731 + 1.4142135623731/w)**2)/w - 0.797884560802865*exp(-(1.4142135623731 - 1.4142135623731/w)**2)/w + 0.797884560802865*exp(-(1.4142135623731 + 1.4142135623731/w)**2)/w**2 + 0.797884560802865*exp(-(1.4142135623731 - 1.4142135623731/w)**2)/w**2\n"
       ]
      },
      {
       "ename": "ValueError",
       "evalue": "Could not find root within given tolerance. (3.79491e+23 > 2.1684e-19)\nTry another starting point or tweak arguments.",
       "output_type": "pyerr",
       "traceback": [
        "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m\n\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
        "\u001b[0;32m<ipython-input-25-3975d4b37912>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     18\u001b[0m     \u001b[0mdi\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mDerivative\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msubs\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mn\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mni\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     19\u001b[0m     \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m'\\tdi'\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mni\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0mdi\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 20\u001b[0;31m     \u001b[0msi\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0msympy\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msolvers\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnsolve\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdi\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;36m1\u001b[0m\u001b[0;34m/\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;36m1.0\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0mni\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     21\u001b[0m     \u001b[0mprint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msi\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;32m/usr/local/lib/python3.4/site-packages/sympy/solvers/solvers.py\u001b[0m in \u001b[0;36mnsolve\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m   2478\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2479\u001b[0m         \u001b[0mf\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mlambdify\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfargs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmodules\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 2480\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mfindroot\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mf\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mx0\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   2481\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0mlen\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfargs\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;34m>\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcols\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   2482\u001b[0m         raise NotImplementedError(filldedent('''\n",
        "\u001b[0;32m/usr/local/lib/python3.4/site-packages/sympy/mpmath/calculus/optimization.py\u001b[0m in \u001b[0;36mfindroot\u001b[0;34m(ctx, f, x0, solver, tol, verbose, verify, **kwargs)\u001b[0m\n\u001b[1;32m    973\u001b[0m                              \u001b[0;34m'(%g > %g)\\n'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    974\u001b[0m                              \u001b[0;34m'Try another starting point or tweak arguments.'\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 975\u001b[0;31m                              % (norm(f(*xl))**2, tol))\n\u001b[0m\u001b[1;32m    976\u001b[0m         \u001b[0;32mreturn\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    977\u001b[0m     \u001b[0;32mfinally\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
        "\u001b[0;31mValueError\u001b[0m: Could not find root within given tolerance. (3.79491e+23 > 2.1684e-19)\nTry another starting point or tweak arguments."
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}