changes interactive labeling

master
Anne Lorenz 2019-01-09 14:02:43 +01:00
parent 9367457199
commit 035583584f
7 changed files with 2062 additions and 1290 deletions

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@ -1,765 +0,0 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Jupyter Notebook for Interactive Labeling\n",
"______\n",
"\n",
"This Jupyter Notebook combines a manual and automated labeling technique.\n",
"It includes a basic implementation of Multinomial Bayes Classifier.\n",
"By calculating estimated class probabilities, we decide whether a news article has to be labeled manually or can be labeled automatically.\n",
"For labeling, 6 classes are used.\n",
"\n",
"\n",
"- **Part I**: Preparation of the data set for labeling.\n",
"\n",
"\n",
"- **Part II**: Execution of iterative labeling process.\n",
"\n",
" \n",
"Please note: User instructions are written in upper-case.\n",
"__________\n",
"Version: 2018-12-01, Anne Lorenz / Datavard AG"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import csv\n",
"import operator\n",
"import pickle\n",
"import random\n",
"\n",
"from ipywidgets import interact, interactive, fixed, interact_manual\n",
"import ipywidgets as widgets\n",
"from IPython.core.interactiveshell import InteractiveShell\n",
"from IPython.display import display\n",
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"from MNBInteractive import MNBInteractive"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Part I"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, we import our data set of 10 000 business news articles from a csv file.\n",
"It contains 833/834 articles of each month of the year 2017.\n",
"For detailed information regarding the data set, please read the full documentation."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of samples in data set in total: 10000\n"
]
}
],
"source": [
"# round number to save intermediate label status of data set\n",
"m = 0\n",
"\n",
"# initialize random => reproducible sequence\n",
"random.seed(5)\n",
"\n",
"filepath = '../data/cleaned_data_set_without_header.csv'\n",
"\n",
"df = pd.read_csv(filepath,\n",
" header=None,\n",
" sep='|',\n",
" engine='python',\n",
" names = [\"Uuid\", \"Title\", \"Text\", \"Site\", \"SiteSection\", \"Url\", \"Timestamp\"],\n",
" decimal='.',\n",
" quotechar='\\'',\n",
" quoting=csv.QUOTE_NONNUMERIC)\n",
"\n",
"# set up wider display area\n",
"pd.set_option('display.max_colwidth', -1)\n",
"\n",
"# add indices\n",
"df['Index'] = df.index.values\n",
"\n",
"# add round annotation (indicates labeling time)\n",
"df['Round'] = np.nan\n",
"\n",
"# initialize label column with -1 for unlabeled samples\n",
"df['Label'] = np.full((len(df)), -1)\n",
"\n",
"# add column for estimated probability\n",
"df['Probability'] = np.nan\n",
"\n",
"# show full text for print statement\n",
"InteractiveShell.ast_node_interactivity = \"all\"\n",
"\n",
"# row number\n",
"n_rows = df.shape[0]\n",
"print('Number of samples in data set in total: {}'.format(n_rows))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The following cell is disabled to prevent later overwriting."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# save as csv\n",
"#df.to_csv('../data/interactive_labeling.csv',\n",
"# sep='|',\n",
"# mode='w',\n",
"# encoding='utf-8',\n",
"# quoting=csv.QUOTE_NONNUMERIC,\n",
"# quotechar='\\'')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).\n",
"In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# global dict of all articles (article index => list of mentioned organizations)\n",
"dict_art_orgs = {}\n",
"with open('../obj/dict_articles_organizations_without_banks.pkl', 'rb') as input:\n",
" dict_art_orgs = pickle.load(input)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"def show_next(index):\n",
" print('News article no. {}:'.format(index))\n",
" print()\n",
" # show title and text of current news article\n",
" print('HEADLINE:')\n",
" print(df.loc[df['Index'] == index, 'Title'])\n",
" print()\n",
" print('TEXT:')\n",
" print(df.loc[df['Index'] == index, 'Text'])\n",
" \n",
" def f(x):\n",
" ''' this function is executed when slider moved\n",
" '''\n",
" # save user input\n",
" df.loc[df['Index'] == index, 'Label'] = x\n",
" # save number of labeling round\n",
" df.loc[df['Index'] == index, 'Round'] = m\n",
"\n",
" # create slider widget for labels\n",
" interact(f, x = widgets.IntSlider(min=-1, max=5, step=1, value=df.loc[df['Index'] == index, 'Label']))\n",
" print('1: merger of companies A and B, 2: merger pending/in talks/to be approved, 3: merger aborted/denied,') \n",
" print('4: sale or buy of shares/parts/assets or merger of units,')\n",
" print('5: merger as incidental remark (not main topic/not current), 0: other/unrelated news, -1: i don\\'t know')\n",
" print('___________________________________________________________________________________________________')\n",
" print()\n",
" print() \n",
"\n",
"# list of article indices that will be shown next\n",
"label_next = []"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## PART II\n",
"\n",
"In each iteration...\n",
" 1. We label the next 10 articles manually.\n",
" \n",
" 2. We apply the Multinomial Naive Bayes classification algorithm which returns a vector class_probs $(K_1, K_2, ... , K_6)$ per sample with the probabilities $K_i$ per class $i$.\n",
" \n",
" 3. We apply class labels automatically where possible. We define a case as distinct, if the estimated probability $K_x > 0.8$ with $x \\in {1,...,6}$. In that case, our program applies the label.\n",
" \n",
" 4. We check and improve the automated labeling if necessary."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The iteration part begins here. User interaction instructions are written in upper-case.\n",
"\n",
"PLEASE ENTER THE CURRENT ITARATION NUMBER ('Round no.')."
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Last iteration number: 16\n",
"\n",
"Number of labeled articles: 320 (3.2 percent)\n",
"Number of unlabeled articles: 9680\n"
]
}
],
"source": [
"# read current data set from csv\n",
"df = pd.read_csv('../data/interactive_labeling.csv',\n",
" sep='|',\n",
" usecols=range(1,12), # drop first column 'unnamed'\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')\n",
"\n",
"# find current iteration/round number\n",
"m = int(df['Round'].max())\n",
"\n",
"print('Last iteration number: {}'.format(m))\n",
"print()\n",
"print('Number of labeled articles: {0} ({1:.2} percent)'.format(len(df.loc[df['Label'] != -1]), \n",
" len(df.loc[df['Label'] != -1])/100))\n",
"print('Number of unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Continue with iteration number: 16\n"
]
}
],
"source": [
"# increment round number\n",
"m += 1\n",
"\n",
"print('Continue with iteration number: {}'.format(m))"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Amazon\n",
"Google\n",
"Alphabet\n",
"EMEA\n",
"BAML\n",
"Royal Dutch Shell\n",
"Glencore\n",
"Abu Dhabi\n",
"Qatar Airways\n",
"Viacom\n",
"Nomura\n",
"General Motors\n",
"Boeing\n",
"Airbus\n",
"Societe Generale\n",
"SEC\n",
"Organization of the Petroleum Exporting Countries\n",
"Toshiba\n",
"LME\n",
"Pepsi\n",
"Microsoft\n",
"BOJ\n",
"Apple\n",
"Amazon.com\n",
"Facebook\n",
"AT & T\n",
"Verizon Communications\n",
"Lloyds\n",
"Unilever\n",
"BP\n",
"Alibaba\n",
"Tesla\n",
"Tencent\n",
"Tesco\n",
"Nestle\n",
"IHS Markit\n",
"VW\n",
"Volkswagen\n",
"Deutsche Telekom\n",
"T-Mobile US\n",
"BHP Billiton\n",
"ING\n",
"CME\n"
]
}
],
"source": [
"# global dict of mentioned companies in labeled articles (company name => number of occurences\n",
"dict_limit = {}\n",
"\n",
"# initialize dict_limit\n",
"df_labeled = df[df['Label'] != -1]\n",
"for index in df_labeled['Index']:\n",
" orgs = dict_art_orgs[index]\n",
" for org in orgs:\n",
" if org in dict_limit:\n",
" dict_limit[org] += 1\n",
" else:\n",
" dict_limit[org] = 1\n",
"\n",
"for k, v in dict_limit.items():\n",
" # print organizations that are mentioned 3 times\n",
" if v == 3:\n",
" print(k)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"def pick_random_articles(n, limit = 3):\n",
" ''' pick n random articles, check if company occurences under limit.\n",
" returns list of n indices of the articles we can label next.\n",
" '''\n",
" # labeling list\n",
" list_arts = []\n",
" # article counter\n",
" i = 0\n",
" while i < n:\n",
" # pick random article\n",
" rand_i = random.randint(0, 9999)\n",
" # check if not yet labeled\n",
" if df.loc[rand_i]['Label'] == -1:\n",
" # list of companies in that article\n",
" companies = dict_art_orgs[rand_i]\n",
" if all((dict_limit.get(company) == None) or (dict_limit[company] < limit ) for company in companies): \n",
" for company in companies:\n",
" if company in dict_limit:\n",
" dict_limit[company] += 1\n",
" else:\n",
" dict_limit[company] = 1\n",
" # add article to labeling list\n",
" list_arts.append(rand_i)\n",
" i += 1\n",
" return list_arts\n",
"\n",
"# generate new list of article indices for labeling\n",
"batchsize = 1\n",
"label_next = pick_random_articles(batchsize)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PLEASE READ THE FOLLOWING ARTICLES AND ENTER THE CORRESPONDING LABELS."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"16"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# check round number\n",
"m"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"News article no. 5332:\n",
"\n",
"HEADLINE:\n",
"5332 Creditors seek to overturn Dana Gas sukuk injunction in UK court\n",
"Name: Title, dtype: object\n",
"\n",
"TEXT:\n",
"5332 Market News - Wed Jul 5, 2017 - 7:03am EDT Creditors seek to overturn Dana Gas sukuk injunction in UK court DUBAI, July 5 The owners of Islamic bonds issued by Abu Dhabi-listed Dana Gas have gone to London's High Court of Justice to try to overturn an injunction that prevents them from forcing repayment of the $700 million of sukuk. Analysts say the case could have ramifications across the Islamic finance industry, with any decision against the creditors potentially undermining confidence in Islamic bonds. Dana Gas argues that because of changes in Islamic financial instruments and how they are interpreted, its sukuk are no longer sharia-compliant, and have become unlawful and unenforceable in the United Arab Emirates. The company says it is therefore halting payments on the mudaraba-style sukuk and proposing its creditors exchange them for new Islamic bonds with lower profit distributions. In mid-June, Dana Gas said it had obtained an interim injunction from London's High Court blocking holders of the sukuk, which are due to mature in October, from enforcing claims against the company related to the bonds. Deutsche Bank, representing the sukuk holders, told the High Court on Tuesday the injunction should be set aside, according to legal documents presented to the court and seen by Reuters. Deutsche Bank told the court Dana's case was \"hopeless as a matter of law,\" arguing that asserting the sukuk were illegal was an \"event of default\" allowing the sukuk holders to demand repayment, the documents show. Dana's actions \"have sent shockwaves around the market for Islamic bonds\" because they could erode trust in other sukuk issues, Deutsche Bank said. The judge did not reach a conclusion on Tuesday, and has asked Dana and the other parties to return to the court on Wednesday, a source familiar with the situation told Reuters. (Reporting by Davide Barbuscia; Editing by Andrew Torchia and Mark Potter) \n",
"Name: Text, dtype: object\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "415d739cf3d643429b8bc3ea4884d88a",
"version_major": 2,
"version_minor": 0
},
"text/html": [
"<p>Failed to display Jupyter Widget of type <code>interactive</code>.</p>\n",
"<p>\n",
" If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
" that the widgets JavaScript is still loading. If this message persists, it\n",
" likely means that the widgets JavaScript library is either not installed or\n",
" not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
" Widgets Documentation</a> for setup instructions.\n",
"</p>\n",
"<p>\n",
" If you're reading this message in another frontend (for example, a static\n",
" rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
" it may mean that your frontend doesn't currently support widgets.\n",
"</p>\n"
],
"text/plain": [
"interactive(children=(IntSlider(value=-1, description='x', max=5, min=-1), Output()), _dom_classes=('widget-interact',))"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"1: merger of companies A and B, 2: merger pending/in talks/to be approved, 3: merger aborted/denied,\n",
"4: sale or buy of shares/parts/assets or merger of units,\n",
"5: merger as incidental remark (not main topic/not current), 0: other/unrelated news, -1: i don't know\n",
"___________________________________________________________________________________________________\n",
"\n",
"\n"
]
}
],
"source": [
"for index in label_next:\n",
" show_next(index)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "55a370bf5df94b3aacf4c0396bc43cad",
"version_major": 2,
"version_minor": 0
},
"text/html": [
"<p>Failed to display Jupyter Widget of type <code>Button</code>.</p>\n",
"<p>\n",
" If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean\n",
" that the widgets JavaScript is still loading. If this message persists, it\n",
" likely means that the widgets JavaScript library is either not installed or\n",
" not enabled. See the <a href=\"https://ipywidgets.readthedocs.io/en/stable/user_install.html\">Jupyter\n",
" Widgets Documentation</a> for setup instructions.\n",
"</p>\n",
"<p>\n",
" If you're reading this message in another frontend (for example, a static\n",
" rendering on GitHub or <a href=\"https://nbviewer.jupyter.org/\">NBViewer</a>),\n",
" it may mean that your frontend doesn't currently support widgets.\n",
"</p>\n"
],
"text/plain": [
"Button(description='Confirm Labels', style=ButtonStyle())"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"5332 0.0\n",
"Name: Label, dtype: float64\n"
]
}
],
"source": [
"# create button widget for confirming labels\n",
"button_confirm = widgets.Button(description='Confirm Labels',\n",
" disabled=False,\n",
" button_style='')\n",
"\n",
"def g(b):\n",
" ''' this function is executed when button_confirm clicked\n",
" ''' \n",
" # show new labels\n",
" print(df.loc[df['Index'].isin(label_next)]['Label'])\n",
"\n",
"# execute function g if button is clicked\n",
"button_confirm.on_click(g)\n",
"\n",
"display(button_confirm)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"PLEASE CLICK THE BUTTON ABOVE ('Confirm Labels') TO CONFIRM YOUR LABELS."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {},
"outputs": [],
"source": [
"# save as csv\n",
"df.to_csv('../data/interactive_labeling.csv',\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"This round (no. 16):\n",
"Number of labeled articles: 320 (3.2 percent)\n",
"Number of unlabeled articles: 9680\n"
]
}
],
"source": [
"print('This round (no. {}):'.format(m))\n",
"print('Number of labeled articles: {0} ({1:.2} percent)'.format(len(df.loc[df['Label'] != -1]), \n",
" len(df.loc[df['Label'] != -1])/100))\n",
"print('Number of unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NOW REPEAT PART II OR CONTINUE WITH PART III.\n",
"\n",
"## Part III\n",
"\n",
"Now we build a model and check if it is possible to label some articles automatically."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
" # split data set into labeled and unlabeled samples\n",
"l_data = df.loc[df['Label'] != -1]\n",
"u_data = df.loc[df['Label'] == -1]\n",
"\n",
"# assign array of classes in order used and array of class probabilities\n",
"%time classes, class_count, class_probs = MNBInteractive.make_nb(l_data, u_data)\n",
"\n",
"print('Label classes in the order in which they are used for class_probs:')\n",
"print(classes)\n",
"\n",
"print('Number of samples of each class:')\n",
"print(class_count)\n",
"\n",
"print('First 10 estimations:')\n",
"print()\n",
"print(class_probs[:10])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of auto-labeled samples in round 13: 0\n"
]
}
],
"source": [
"# list of tuples (articles that were automatically labeled in this round and their estimated label probability)\n",
"tuples_auto_labeled = []\n",
"\n",
"def insert_estimated_labels(threshold = 0.8):\n",
" '''label article with class j, if estimated probability\n",
" for class j is higher than threshold\n",
" '''\n",
" for j, vector in enumerate(class_probs):\n",
" for i in range(len(classes)):\n",
" # check if probability of class i is not less than threshold\n",
" if vector[i] > threshold:\n",
" # adopt the estimated label\n",
" u_data[j]['Label'] = classes[i]\n",
" # annotate probability\n",
" u_data[j]['Probability'] = vector[i]\n",
" # insert current round number\n",
" u_data[j]['Round'] = m\n",
" # add to list 'new_labeled'\n",
" tuples_auto_labeled.append(u_data[j]['Index'], u_data[j]['Probability'])\n",
"\n",
"# insert estimated labels\n",
"insert_estimated_labels()\n",
"\n",
"print('Number of auto-labeled samples in round {}: {}'.format(m, len(tuples_auto_labeled)))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"if len(tuples_auto_labeled) > 0:\n",
"\n",
" # sort new labeled articles by their estimated probability and return list of indices\n",
" list_auto_labeled = [t[0] for t in sorted(tuples_auto_labeled, key=lambda x: x[1])]\n",
"\n",
" # concatenate labeled and unlabeled data\n",
" df = pd.concat([l_data, u_data],\n",
" ignore_index=True)\n",
"\n",
" # sort dataframe by index\n",
" df = df.sort_values(['Index'])\n",
"\n",
" # create button widget for checking labels\n",
" button_check = widgets.Button(description='Check Label',\n",
" disabled=False,\n",
" button_style='')\n",
" def h(b):\n",
" ''' this function is executed when button 'Check Labels' clicked\n",
" '''\n",
" show_next(list_auto_labeled[0])\n",
" del list_auto_labeled[0]\n",
"\n",
" # execute function g if button is clicked\n",
" button_check.on_click(h)\n",
"\n",
" # while there is still a auto-labeled article not yet checked\n",
" # check sample with lowest estimated probability next\n",
" while len(list_auto_labeled) > 0:\n",
" print('PLEASE CLICK BUTTON BELOW (\\'Check Labels\\') TO CHECK AUTO-LABELED SAMPLE')\n",
" display(button_check)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"End of this round (no. 13):\n",
"Number of labeled articles: 300 (3.0 percent)\n",
"Number of unlabeled articles: 9700\n"
]
}
],
"source": [
"print('End of this round (no. {}):'.format(m))\n",
"print('Number of labeled articles: {0} ({1:.2} percent)'.format(len(df.loc[df['Label'] != -1]), \n",
" len(df.loc[df['Label'] != -1])/100))\n",
"print('Number of unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))\n",
"\n",
"# save to csv\n",
"df.to_csv('../data/interactive_labeling.csv',\n",
" sep='|',\n",
" mode='w',\n",
" encoding='utf-8',\n",
" quoting=csv.QUOTE_NONNUMERIC,\n",
" quotechar='\\'')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NOW PLEASE CONTINUE ITERATION. LET PART II RUN AGAIN, CELL BY CELL.\n",
"\n",
"REPEAT UNTIL ALL SAMPLES ARE LABELED."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

File diff suppressed because it is too large Load Diff

View File

@ -17,11 +17,9 @@ class MNBInteractive:
However, in practice, fractional counts such as tf-idf may also work.
'''
def make_nb(labeled_data, unlabeled_data):
def make_nb(labeled_data, unlabeled_data, sklearn_cv=False):
'''fits naive bayes model
'''
# chose BagOfWords implementation (own if false)
sklearn_cv = False
print('# MNB: starting multinomial naives bayes...')
print()
@ -64,7 +62,7 @@ class MNBInteractive:
else:
# use my own BagOfWords python implementation
stemming = True
rel_freq = True
rel_freq = False
extracted_words = BagOfWords.extract_all_words(X)
vocab = BagOfWords.make_vocab(extracted_words)

View File

@ -200,15 +200,22 @@ class NER:
with open('../obj/dict_articles_organizations.pkl', 'rb') as input:
dict = pickle.load(input)
black_list = ['Eastern and Southern African Trade and Development Bank', 'PTA Bank', 'Citigroup',
black_list = ['Eastern and Southern African Trade and Development Bank', 'PTA Bank', 'Citigroup', 'UniCredit',
'Rand Merchant Bank', 'Banca Carige', 'World Bank', 'Bank of America', 'Deutsche Bank', 'HSBC', 'JP Morgan',
'Credit Suisse', 'JPMorgan', 'BNP Paribas', 'Goldman Sachs', 'Commerzbank', 'Deutsche Boerse', 'Handelsblatt',
'Sky News', 'Labour', 'UN', 'Bank of Japan', 'Goldman', 'Goldman Sachs Asset Management', 'New York Times',
'Bank of Scotland','World Economic Forum','Organisation for Economic Cooperation and Development',
'Labour', 'UN', 'Bank of Japan', 'Goldman', 'Goldman Sachs Asset Management', 'New York Times', 'Royal Bank',
'Bank of Scotland','World Economic Forum','Organisation for Economic Cooperation and Development', 'Blackstone',
'Russell Investments','Royal London Asset Management','Conservative party','Blom Bank','Banco Santander',
'Guardian Money','Financial Services Agency','Munich Re','Banca Popolare di Vicenza','SoftBank',
'Guardian Money','Financial Services Agency','Munich Re','Banca Popolare di Vicenza','SoftBank', 'Sberbank',
'Financial Conduct Authority','Qatar National Bank','Welt am Sonntag','Sueddeutsche Zeitung','Der Spiegel',
'Bank of England', 'Bank of America Merrill Lynch', 'Barclays', 'London Metal Exchange', 'Petroleum Exporting Countries']
'Bank of England', 'Bank of America Merrill Lynch', 'Barclays', 'London Metal Exchange', 'EMEA', 'G20',
'Petroleum Exporting Countries', 'Facebook Twitter Pinterest', 'Moody', 'Allianz', 'Citi', 'Bank', 'CME',
'JPMorgan Chase &', 'Trade Alert', 'Abu Dhabi', 'MILAN', 'Journal', 'MSCI', 'KKR', 'CNBC', 'Feb', 'OECD',
'Gulf Cooperation Council', 'Societe Generale', 'Takata', 'SEC', 'Republican', 'Energy Information Administration',
'Organization of the Petroleum Exporting Countries', 'CBOE', 'LME', 'BOJ', 'BlackRock', 'Banco Popular',
'United Nations', 'CET STOCKS Latest Previo Daily Change', 'Citibank', 'International Energy Agency',
'Confederation of British Industry', 'American Petroleum Institute', 'Deutsche', 'United', 'Pentagon',
'Southern District of New York']
for k, v in dict.items():
for org in black_list: