jupyter notebook ready for labeling

master
Anne Lorenz 2018-12-19 12:06:43 +01:00
parent 59c664fbb0
commit 8b36686d0c
2 changed files with 10231 additions and 95 deletions

File diff suppressed because one or more lines are too long

View File

@ -26,7 +26,7 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
@ -63,7 +63,7 @@
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 20,
"metadata": {},
"outputs": [
{
@ -115,48 +115,39 @@
"print('Number of samples in data set in total: {}'.format(n_rows))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following cell is disabled to prevent later overwriting."
]
},
{
"cell_type": "code",
"execution_count": 3,
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# save as csv\n",
"df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
"#df.to_csv('../data/interactive_labeling.csv',\n",
"# sep='|',\n",
"# mode='w',\n",
"# encoding='utf-8',\n",
"# quoting=csv.QUOTE_NONNUMERIC,\n",
"# quotechar='\\'')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PART II\n",
"\n",
"In each iteration...\n",
" 1. We label the next 10 articles manually.\n",
" \n",
" 2. We apply the Multinomial Naive Bayes classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$.\n",
" \n",
" 3. We apply class labels automatically where possible. We define a case as distinct, if the estimated probability $K_x > 0.8$ with $x \\in {1,...,6}$. In that case, our program applies the label.\n",
" \n",
" 4. We check and improve the automated labeling if necessary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First we load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
"We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
"In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance."
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
@ -168,7 +159,7 @@
},
{
"cell_type": "code",
"execution_count": 5,
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
@ -231,6 +222,22 @@
"label_next = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PART II\n",
"\n",
"In each iteration...\n",
" 1. We label the next 10 articles manually.\n",
" \n",
" 2. We apply the Multinomial Naive Bayes classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$.\n",
" \n",
" 3. We apply class labels automatically where possible. We define a case as distinct, if the estimated probability $K_x > 0.8$ with $x \\in {1,...,6}$. In that case, our program applies the label.\n",
" \n",
" 4. We check and improve the automated labeling if necessary."
]
},
{
"cell_type": "markdown",
"metadata": {},
@ -242,13 +249,13 @@
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ce204ae8b66d4557972539bda12a5e6f",
"model_id": "6c56c792f25843799d540a887e3639f8",
"version_major": 2,
"version_minor": 0
},
@ -286,16 +293,9 @@
"display(w)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NOW PLEASE LET THE FOLLOWING CELLS RUN AGAIN."
]
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 26,
"metadata": {},
"outputs": [
{
@ -315,37 +315,46 @@
},
{
"cell_type": "code",
"execution_count": 8,
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"# read current data set from csv\n",
"df = pd.read_csv('../data/interactive_labeling.csv',\n",
" sep='|',\n",
" usecols=range(1,12), # drop first column 'unnamed'\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Last round:\n",
"Last round (no. 0):\n",
"Number of labeled articles: 0\n",
"Number of unlabeled articles: 10000\n"
]
}
],
"source": [
"# read last data set from csv\n",
"df = pd.read_csv('../data/interactive_labeling_round_{}.csv'.format(m-1),\n",
" sep='|',\n",
" usecols=range(1,12), # drop first column 'unnamed'\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')\n",
"\n",
"print('Last round:')\n",
"print('Last round (no. {}):'.format(m-1))\n",
"print('Number of labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
"print('Number of unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"execution_count": 29,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
@ -370,7 +379,7 @@
},
{
"cell_type": "code",
"execution_count": 10,
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
@ -388,7 +397,7 @@
},
{
"cell_type": "code",
"execution_count": 11,
"execution_count": 12,
"metadata": {},
"outputs": [
{
@ -409,7 +418,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "da64d65008f2458f9f2d21bb875ed75a",
"model_id": "286c21f192774952bd60d051cf197a8a",
"version_major": 2,
"version_minor": 0
},
@ -458,7 +467,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "cf730482c9b34f909d64d577fe27045b",
"model_id": "8b626884311344a8b8b33ae57a46f9ef",
"version_major": 2,
"version_minor": 0
},
@ -507,7 +516,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "03a9887730274dfe880b6e3b3f26f134",
"model_id": "fb2f45b1947f4cd0a9fc57e33e188348",
"version_major": 2,
"version_minor": 0
},
@ -556,7 +565,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7fdb5a8441b444f797b2c6f111791be0",
"model_id": "1d5d56ee6f9a4f91bdeee367ddedbbc5",
"version_major": 2,
"version_minor": 0
},
@ -605,7 +614,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "ccc699d5ee1f4c6aa6b23ab27bdf8cfb",
"model_id": "3e1cbc9e465945d2a36b3a40b738f023",
"version_major": 2,
"version_minor": 0
},
@ -654,7 +663,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "04674ce34fd549d5b7fbdf0982b476bc",
"model_id": "81ba9eef3ac049f58b57a9300e88a20c",
"version_major": 2,
"version_minor": 0
},
@ -703,7 +712,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "dc3abc4c300f4b64aad8acaccc7f4e3a",
"model_id": "dc76006231a741838d98f30b56208934",
"version_major": 2,
"version_minor": 0
},
@ -752,7 +761,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "f56b384620fd4ec7bfbade536a0526a5",
"model_id": "111da7a4a7244fe1849502554ffc20c7",
"version_major": 2,
"version_minor": 0
},
@ -801,7 +810,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a9e6c9e12a754d4f87385dfa9a0bf356",
"model_id": "62e65ecb60f04c93be7b9d493c90db9c",
"version_major": 2,
"version_minor": 0
},
@ -850,7 +859,7 @@
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "93f2631f19cf4f05b5b6605e2fb8a010",
"model_id": "a3c29fd05f7247e0b125aa867601a89b",
"version_major": 2,
"version_minor": 0
},
@ -895,7 +904,7 @@
},
{
"cell_type": "code",
"execution_count": 12,
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
@ -909,7 +918,7 @@
" ''' \n",
" # show new labels\n",
" print(df.loc[df['Index'].isin(label_next)]['Label'])\n",
" \n",
"\n",
"# execute function g if button is clicked\n",
"button_confirm.on_click(g)"
]
@ -923,13 +932,13 @@
},
{
"cell_type": "code",
"execution_count": 13,
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "caaeaf0a747e4c88a1ea87e14c6e193a",
"model_id": "7de4c71a9825428bae7cbe4b6874a1b6",
"version_major": 2,
"version_minor": 0
},
@ -954,6 +963,23 @@
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"475 2.0\n",
"849 0.0\n",
"1854 0.0\n",
"2569 0.0\n",
"4080 0.0\n",
"4185 0.0\n",
"5874 0.0\n",
"6091 5.0\n",
"7628 0.0\n",
"8684 0.0\n",
"Name: Label, dtype: float64\n"
]
}
],
"source": [
@ -962,32 +988,96 @@
},
{
"cell_type": "code",
"execution_count": 14,
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# save as csv\n",
"df.to_csv('../data/interactive_labeling.csv',\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
" \n",
"# split data set into labeled and unlabeled samples\n",
"l_data = df.loc[df['Label'] != -1]\n",
"u_data = df.loc[df['Label'] == -1]"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This round:\n",
"Number of labeled articles: 0\n",
"Number of unlabeled articles: 10000\n"
"This round (no. 1):\n",
"Number of labeled articles: 10\n",
"Number of unlabeled articles: 9990\n"
]
}
],
"source": [
"# split data set into labeled and unlabeled samples\n",
"l_data = df.loc[df['Label'] != -1]\n",
"u_data = df.loc[df['Label'] == -1]\n",
"\n",
"print('This round:')\n",
"print('This round (no. {}):'.format(m))\n",
"print('Number of labeled articles: {}'.format(len(l_data)))\n",
"print('Number of unlabeled articles: {}'.format(len(u_data)))"
]
},
{
"cell_type": "code",
"execution_count": 15,
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"# MNB: starting multinomial naives bayes...\n",
"\n",
"# BOW: extracting all words from articles...\n",
"\n",
"# BOW: making vocabulary of data set...\n",
"\n",
"# BOW: vocabulary consists of 947 features.\n",
"\n",
"# MNB: fit training data and calculate matrix...\n",
"\n",
"# BOW: calculating matrix...\n",
"\n",
"# BOW: calculating frequencies...\n",
"\n",
"# MNB: transform testing data to matrix...\n",
"\n",
"# BOW: extracting all words from articles...\n",
"\n",
"# BOW: calculating matrix...\n",
"\n",
"# BOW: calculating frequencies...\n",
"\n",
"# MNB: ending multinomial naive bayes\n",
"Wall time: 21min 5s\n"
]
}
],
"source": [
"# assign array of classes in order used and array of class probabilities\n",
"%time classes, class_probs = MNBInteractive.make_nb(l_data, u_data)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
@ -995,26 +1085,48 @@
"output_type": "stream",
"text": [
"Label classes in the order in which they are used for class_probs:\n",
"[0, 2, 5]\n",
"[[0.3, 0.3, 0.4]]\n"
"[0. 2. 5.]\n"
]
}
],
"source": [
"# assign array of classes in order used and array of class probabilities\n",
"#%time classes, class_probs = MNBInteractive.make_nb(l_data, u_data)\n",
"classes = [0, 2, 5]\n",
"class_probs = [[0.3, 0.3, 0.4]]\n",
"\n",
"print('Label classes in the order in which they are used for class_probs:')\n",
"print(classes)\n",
"\n",
"print(classes)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First 10 estimations:\n",
"\n",
"[[0.33335808 0.33330861 0.33333331]\n",
" [0.33335762 0.33331573 0.33332665]\n",
" [0.33336293 0.33332051 0.33331657]\n",
" [0.3333764 0.33330196 0.33332164]\n",
" [0.33338693 0.33331068 0.33330239]\n",
" [0.33336951 0.33332574 0.33330475]\n",
" [0.33341934 0.3333032 0.33327745]\n",
" [0.33349444 0.33324293 0.33326262]\n",
" [0.33337205 0.3333111 0.33331685]\n",
" [0.3333632 0.33331446 0.33332234]]\n"
]
}
],
"source": [
"print('First 10 estimations:')\n",
"print()\n",
"print(class_probs[:10])"
]
},
{
"cell_type": "code",
"execution_count": 16,
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
@ -1044,7 +1156,7 @@
},
{
"cell_type": "code",
"execution_count": 17,
"execution_count": 40,
"metadata": {},
"outputs": [
{
@ -1061,12 +1173,11 @@
},
{
"cell_type": "code",
"execution_count": 19,
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# sort new labeled articles by their estimated probability and return list of indices\n",
"# first check sample with lowest estimated probability\n",
"list_auto_labeled = [t[0] for t in sorted(tuples_auto_labeled, key=lambda x: x[1])]\n",
"\n",
"# concatenate labeled and unlabeled data\n",
@ -1079,7 +1190,7 @@
},
{
"cell_type": "code",
"execution_count": 20,
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
@ -1097,6 +1208,7 @@
"button_check.on_click(h)\n",
"\n",
"# while there is still a auto-labeled article not yet checked\n",
"# check sample with lowest estimated probability next\n",
"while len(list_auto_labeled) > 0:\n",
" print('PLEASE CLICK BUTTON BELOW (\\'Check Labels\\') TO CHECK AUTO-LABELED SAMPLE')\n",
" display(button_check)"
@ -1104,12 +1216,33 @@
},
{
"cell_type": "code",
"execution_count": 21,
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"End of this round (no. 1):\n",
"Number of labeled articles: 10\n",
"Number of unlabeled articles: 9990\n"
]
}
],
"source": [
"print('End of this round (no. {}):'.format(m))\n",
"print('Number of labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))\n",
"print('Number of unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"# save this round as csv\n",
"df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),\n",
"# save to csv\n",
"df.to_csv('../data/interactive_labeling.csv',\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
@ -1121,7 +1254,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"NOW PLEASE LET PART II RUN AGAIN. REPEAT UNTIL ALL SAMPLES ARE LABELED."
"NOW PLEASE CONTINUE ITERATION. LET PART II RUN AGAIN, CELL BY CELL.\n",
"\n",
"REPEAT UNTIL ALL SAMPLES ARE LABELED."
]
}
],