jupyter notebook ready for labeling

2018-12-19 12:06:43 +01:00 · 2018-12-19 12:06:43 +01:00 · 8b36686d0c
commit 8b36686d0c
parent 59c664fbb0
2 changed files with 10231 additions and 95 deletions
--- a/data/interactive_labeling.csv
+++ b/data/interactive_labeling.csv
--- a/src/2018-12-01-al-interactive-labeling.ipynb
+++ b/src/2018-12-01-al-interactive-labeling.ipynb
@ -26,7 +26,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
@ -63,7 +63,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
@ -115,48 +115,39 @@
    "print('Number of samples in data set in total: {}'.format(n_rows))"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following cell is disabled to prevent later overwriting."
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 3,
+   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save as csv\n",
-    "df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n",
-    "      sep='|',\n",
-    "      mode='w',\n",
-    "      encoding='utf-8',\n",
-    "      quoting=csv.QUOTE_NONNUMERIC,\n",
-    "      quotechar='\\'')"
+    "#df.to_csv('../data/interactive_labeling.csv',\n",
+    "#      sep='|',\n",
+    "#      mode='w',\n",
+    "#      encoding='utf-8',\n",
+    "#      quoting=csv.QUOTE_NONNUMERIC,\n",
+    "#      quotechar='\\'')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## PART II\n",
-    "\n",
-    "In each iteration...\n",
-    "  1. We label the next 10 articles manually.\n",
-    "  \n",
-    "  2. We apply the Multinomial Naive Bayes classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$.\n",
-    "  \n",
-    "  3. We apply class labels automatically where possible. We define a case as distinct, if the estimated probability $K_x > 0.8$ with $x \\in {1,...,6}$. In that case, our program applies the label.\n",
-    "  \n",
-    "  4. We check and improve the automated labeling if necessary."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "First we load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
+    "We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
    "In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance."
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
@ -168,7 +159,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 5,
+   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
@ -231,6 +222,22 @@
    "label_next = []"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## PART II\n",
+    "\n",
+    "In each iteration...\n",
+    "  1. We label the next 10 articles manually.\n",
+    "  \n",
+    "  2. We apply the Multinomial Naive Bayes classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$.\n",
+    "  \n",
+    "  3. We apply class labels automatically where possible. We define a case as distinct, if the estimated probability $K_x > 0.8$ with $x \\in {1,...,6}$. In that case, our program applies the label.\n",
+    "  \n",
+    "  4. We check and improve the automated labeling if necessary."
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@ -242,13 +249,13 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "ce204ae8b66d4557972539bda12a5e6f",
+       "model_id": "6c56c792f25843799d540a887e3639f8",
       "version_major": 2,
       "version_minor": 0
      },
@ -286,16 +293,9 @@
    "display(w)"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "NOW PLEASE LET THE FOLLOWING CELLS RUN AGAIN."
-   ]
-  },
  {
   "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
@ -315,37 +315,46 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 8,
+   "execution_count": 27,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# read current data set from csv\n",
+    "df = pd.read_csv('../data/interactive_labeling.csv',\n",
+    "          sep='|',\n",
+    "          usecols=range(1,12), # drop first column 'unnamed'\n",
+    "          encoding='utf-8',\n",
+    "          quoting=csv.QUOTE_NONNUMERIC,\n",
+    "          quotechar='\\'')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "Last round:\n",
+      "Last round (no. 0):\n",
      "Number of labeled articles: 0\n",
      "Number of unlabeled articles: 10000\n"
     ]
    }
   ],
   "source": [
-    "# read last data set from csv\n",
-    "df = pd.read_csv('../data/interactive_labeling_round_{}.csv'.format(m-1),\n",
-    "          sep='|',\n",
-    "          usecols=range(1,12), # drop first column 'unnamed'\n",
-    "          encoding='utf-8',\n",
-    "          quoting=csv.QUOTE_NONNUMERIC,\n",
-    "          quotechar='\\'')\n",
-    "\n",
-    "print('Last round:')\n",
+    "print('Last round (no. {}):'.format(m-1))\n",
    "print('Number of labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
    "print('Number of unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 9,
-   "metadata": {},
+   "execution_count": 29,
+   "metadata": {
+    "scrolled": true
+   },
   "outputs": [
    {
     "name": "stdout",
@ -370,7 +379,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 10,
+   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
@ -388,7 +397,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 11,
+   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
@ -409,7 +418,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "da64d65008f2458f9f2d21bb875ed75a",
+       "model_id": "286c21f192774952bd60d051cf197a8a",
       "version_major": 2,
       "version_minor": 0
      },
@ -458,7 +467,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "cf730482c9b34f909d64d577fe27045b",
+       "model_id": "8b626884311344a8b8b33ae57a46f9ef",
       "version_major": 2,
       "version_minor": 0
      },
@ -507,7 +516,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "03a9887730274dfe880b6e3b3f26f134",
+       "model_id": "fb2f45b1947f4cd0a9fc57e33e188348",
       "version_major": 2,
       "version_minor": 0
      },
@ -556,7 +565,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "7fdb5a8441b444f797b2c6f111791be0",
+       "model_id": "1d5d56ee6f9a4f91bdeee367ddedbbc5",
       "version_major": 2,
       "version_minor": 0
      },
@ -605,7 +614,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "ccc699d5ee1f4c6aa6b23ab27bdf8cfb",
+       "model_id": "3e1cbc9e465945d2a36b3a40b738f023",
       "version_major": 2,
       "version_minor": 0
      },
@ -654,7 +663,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "04674ce34fd549d5b7fbdf0982b476bc",
+       "model_id": "81ba9eef3ac049f58b57a9300e88a20c",
       "version_major": 2,
       "version_minor": 0
      },
@ -703,7 +712,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "dc3abc4c300f4b64aad8acaccc7f4e3a",
+       "model_id": "dc76006231a741838d98f30b56208934",
       "version_major": 2,
       "version_minor": 0
      },
@ -752,7 +761,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "f56b384620fd4ec7bfbade536a0526a5",
+       "model_id": "111da7a4a7244fe1849502554ffc20c7",
       "version_major": 2,
       "version_minor": 0
      },
@ -801,7 +810,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "a9e6c9e12a754d4f87385dfa9a0bf356",
+       "model_id": "62e65ecb60f04c93be7b9d493c90db9c",
       "version_major": 2,
       "version_minor": 0
      },
@ -850,7 +859,7 @@
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "93f2631f19cf4f05b5b6605e2fb8a010",
+       "model_id": "a3c29fd05f7247e0b125aa867601a89b",
       "version_major": 2,
       "version_minor": 0
      },
@ -895,7 +904,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 12,
+   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
@ -909,7 +918,7 @@
    "    ''' \n",
    "    # show new labels\n",
    "    print(df.loc[df['Index'].isin(label_next)]['Label'])\n",
-    "            \n",
+    "\n",
    "# execute function g if button is clicked\n",
    "button_confirm.on_click(g)"
   ]
@ -923,13 +932,13 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 13,
+   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
-       "model_id": "caaeaf0a747e4c88a1ea87e14c6e193a",
+       "model_id": "7de4c71a9825428bae7cbe4b6874a1b6",
       "version_major": 2,
       "version_minor": 0
      },
@ -954,6 +963,23 @@
     },
     "metadata": {},
     "output_type": "display_data"
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "475     2.0\n",
+      "849     0.0\n",
+      "1854    0.0\n",
+      "2569    0.0\n",
+      "4080    0.0\n",
+      "4185    0.0\n",
+      "5874    0.0\n",
+      "6091    5.0\n",
+      "7628    0.0\n",
+      "8684    0.0\n",
+      "Name: Label, dtype: float64\n"
+     ]
    }
   ],
   "source": [
@ -962,32 +988,96 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 14,
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# save as csv\n",
+    "df.to_csv('../data/interactive_labeling.csv',\n",
+    "      sep='|',\n",
+    "      mode='w',\n",
+    "      encoding='utf-8',\n",
+    "      quoting=csv.QUOTE_NONNUMERIC,\n",
+    "      quotechar='\\'')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    " \n",
+    "# split data set into labeled and unlabeled samples\n",
+    "l_data = df.loc[df['Label'] != -1]\n",
+    "u_data = df.loc[df['Label'] == -1]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
-      "This round:\n",
-      "Number of labeled articles: 0\n",
-      "Number of unlabeled articles: 10000\n"
+      "This round (no. 1):\n",
+      "Number of labeled articles: 10\n",
+      "Number of unlabeled articles: 9990\n"
     ]
    }
   ],
   "source": [
-    "# split data set into labeled and unlabeled samples\n",
-    "l_data = df.loc[df['Label'] != -1]\n",
-    "u_data = df.loc[df['Label'] == -1]\n",
-    "\n",
-    "print('This round:')\n",
+    "print('This round (no. {}):'.format(m))\n",
    "print('Number of labeled articles: {}'.format(len(l_data)))\n",
    "print('Number of unlabeled articles: {}'.format(len(u_data)))"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": 33,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "# MNB: starting multinomial naives bayes...\n",
+      "\n",
+      "# BOW: extracting all words from articles...\n",
+      "\n",
+      "# BOW: making vocabulary of data set...\n",
+      "\n",
+      "# BOW: vocabulary consists of 947 features.\n",
+      "\n",
+      "# MNB: fit training data and calculate matrix...\n",
+      "\n",
+      "# BOW: calculating matrix...\n",
+      "\n",
+      "# BOW: calculating frequencies...\n",
+      "\n",
+      "# MNB: transform testing data to matrix...\n",
+      "\n",
+      "# BOW: extracting all words from articles...\n",
+      "\n",
+      "# BOW: calculating matrix...\n",
+      "\n",
+      "# BOW: calculating frequencies...\n",
+      "\n",
+      "# MNB: ending multinomial naive bayes\n",
+      "Wall time: 21min 5s\n"
+     ]
+    }
+   ],
+   "source": [
+    "# assign array of classes in order used and array of class probabilities\n",
+    "%time classes, class_probs = MNBInteractive.make_nb(l_data, u_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
@ -995,26 +1085,48 @@
     "output_type": "stream",
     "text": [
      "Label classes in the order in which they are used for class_probs:\n",
-      "[0, 2, 5]\n",
-      "[[0.3, 0.3, 0.4]]\n"
+      "[0. 2. 5.]\n"
     ]
    }
   ],
   "source": [
-    "# assign array of classes in order used and array of class probabilities\n",
-    "#%time classes, class_probs = MNBInteractive.make_nb(l_data, u_data)\n",
-    "classes = [0, 2, 5]\n",
-    "class_probs = [[0.3, 0.3, 0.4]]\n",
-    "\n",
    "print('Label classes in the order in which they are used for class_probs:')\n",
-    "print(classes)\n",
-    "\n",
+    "print(classes)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 38,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "First 10 estimations:\n",
+      "\n",
+      "[[0.33335808 0.33330861 0.33333331]\n",
+      " [0.33335762 0.33331573 0.33332665]\n",
+      " [0.33336293 0.33332051 0.33331657]\n",
+      " [0.3333764  0.33330196 0.33332164]\n",
+      " [0.33338693 0.33331068 0.33330239]\n",
+      " [0.33336951 0.33332574 0.33330475]\n",
+      " [0.33341934 0.3333032  0.33327745]\n",
+      " [0.33349444 0.33324293 0.33326262]\n",
+      " [0.33337205 0.3333111  0.33331685]\n",
+      " [0.3333632  0.33331446 0.33332234]]\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('First 10 estimations:')\n",
+    "print()\n",
    "print(class_probs[:10])"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 16,
+   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
@ -1044,7 +1156,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
@ -1061,12 +1173,11 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
    "# sort new labeled articles by their estimated probability and return list of indices\n",
-    "# first check sample with lowest estimated probability\n",
    "list_auto_labeled = [t[0] for t in sorted(tuples_auto_labeled, key=lambda x: x[1])]\n",
    "\n",
    "# concatenate labeled and unlabeled data\n",
@ -1079,7 +1190,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
@ -1097,6 +1208,7 @@
    "button_check.on_click(h)\n",
    "\n",
    "# while there is still a auto-labeled article not yet checked\n",
+    "# check sample with lowest estimated probability next\n",
    "while len(list_auto_labeled) > 0:\n",
    "    print('PLEASE CLICK BUTTON BELOW (\\'Check Labels\\') TO CHECK AUTO-LABELED SAMPLE')\n",
    "    display(button_check)"
@ -1104,12 +1216,33 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 43,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "End of this round (no. 1):\n",
+      "Number of labeled articles: 10\n",
+      "Number of unlabeled articles: 9990\n"
+     ]
+    }
+   ],
+   "source": [
+    "print('End of this round (no. {}):'.format(m))\n",
+    "print('Number of labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
+    "print('Number of unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
-    "# save this round as csv\n",
-    "df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n",
+    "# save to csv\n",
+    "df.to_csv('../data/interactive_labeling.csv',\n",
    "      sep='|',\n",
    "      mode='w',\n",
    "      encoding='utf-8',\n",
@ -1121,7 +1254,9 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "NOW PLEASE LET PART II RUN AGAIN. REPEAT UNTIL ALL SAMPLES ARE LABELED."
+    "NOW PLEASE CONTINUE ITERATION. LET PART II RUN AGAIN, CELL BY CELL.\n",
+    "\n",
+    "REPEAT UNTIL ALL SAMPLES ARE LABELED."
   ]
  }
 ],