# Jupyter Notebook for Interactive Labeling
______

This Jupyter Notebook combines a manual and automated labeling technique.
It includes a basic implementation of Multinomial Bayes Classifier.
By calculating estimated class probabilities, we decide whether a news article has to be labeled manually or can be labeled automatically.
For multiclass labeling, 3 classes are used.
  
Please note: User instructions are written in upper-case.
__________
Version: 2019-02-28, Anne Lorenz

In [1]:
import csv
import operator
import pickle
import random

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets
from IPython.core.interactiveshell import InteractiveShell
from IPython.display import display
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectPercentile
from sklearn.metrics import recall_score, precision_score, f1_score, make_scorer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.semi_supervised import label_propagation

from BagOfWords import BagOfWords
from MNBInteractive import MNBInteractive
from MultinomialNaiveBayes import MultinomialNaiveBayes

## Part I: Data preparation

First, we import our data set of 10 000 business news articles from a csv file.
It contains 833/834 articles of each month of the year 2017.
For detailed information regarding the data set, please read the full documentation.

In [2]:
# initialize random => reproducible sequence
random.seed(5)

filepath = '../data/cleaned_data_set_without_header.csv'

# set up wider display area
pd.set_option('display.max_colwidth', -1)

# show full text for print statement
InteractiveShell.ast_node_interactivity = "all"

We load the previously created dictionary of all article indices (keys) with a list of mentioned organizations (values).
In the following, we limit the number of occurences of a certain company name in all labeled articles to 3 to avoid imbalance.

PLEASE INSERT M MANUALLY IF PROCESS HAS BEEN INTERRUPTED BEFORE.

In [8]:
m=16

In [9]:
# read current data set from csv
df = pd.read_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),
          sep='|',
          usecols=range(1,13), # drop first column 'unnamed'
          encoding='utf-8',
          quoting=csv.QUOTE_NONNUMERIC,
          quotechar='\'')

# find current iteration/round number
m = int(df['Round'].max())
print('This round number: {}'.format(m))
print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))
print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))

This round number: 16
Number of manually labeled articles: 1132
Number of manually unlabeled articles: 8868


START:

Building the training data set using stratified sampling:

In [5]:
m = 0

In [6]:
# iteration number
m

0

In [7]:
#set_0 = df.loc[(df['Round'] < m) & (df['Label'] == 0)]
#set_1 = df.loc[(df['Round'] < m) & (df['Label'] == 1)]
#set_2 = df.loc[(df['Round'] < m) & (df['Label'] == 2)]

In [8]:
strat_len = min(len(set_0), len(set_1), len(set_2))
len(set_0)
len(set_1)
len(set_2)
strat_len

84

3

13

3

In [9]:
# neu f√ºr CV: 
#training_data = df.loc[(df['Round'] < m)]

Unnamed: 0,Uuid,Title,Text,Site,SiteSection,Url,Timestamp,Index,Round,Label,Probability,Estimated
9819,05581b3fad7dde79fb4a44ab11400222423903a7,"CANADA STOCKS-TSX posts record high as energy, marijuana company shares climb","December 27, 2017 / 9:52 PM / in 8 minutes CANADA STOCKS-TSX posts record high as energy, marijuana company shares climb Reuters Staff (Adds details throughout and updates prices to close) * TSX ends up 37.86 points, or 0.23 percent, at 16,203.13 * Index posts record closing high * Energy climbs 1.6 percent * Canopy Growth Co surges 20.1 percent TORONTO, Dec 27 (Reuters) - Canadas main stock index rose on Wednesday to a record high as a recent rally in commodity prices boosted the energy and materials sectors, while healthcare gained more than 6 percent as shares of marijuana companies jumped. * The Toronto Stock Exchanges S&P/TSX composite index ended up 37.86 points, or 0.23 percent, at 16,203.13, a record closing high. * Energy shares climbed 1.6 percent, with Suncor Energy Inc up 2.5 percent at C$45.82. * The price of U.S. crude oil settled 0.6 percent lower at $59.64 a barrel. But it had touched a 2-1/2-year high in the previous session when the TSX was closed for the Boxing Day holiday. * The materials group, which includes precious and base metals miners and fertilizer companies, added 0.9 percent. * Teck Resources Ltd, which exports steelmaking coal and mines metals, including copper, gained 3.1 percent to C$33.33. * Copper prices advanced 1.3 percent to $7,219 a tonne. * Five of the TSXs 10 main groups ended higher. * Shares of marijuana companies rose after Canadian regulators rejected Aurora Cannabis Incs request to shorten the minimum deposit period to 35 days from 105 days for the hostile takeover of CanniMed Therapeutics Inc. * Aurora Cannabis gained 11.1 percent and CanniMed Therapeutics rose nearly 4 percent, while Canopy Growth Co was the largest percentage gainer on the TSX. It surged 20.1 percent to C$27.77. * The largest decliner on the index was Centerra Gold , which plunged 10.3 percent to C$6.50 after the company said mill processing operations at the Mount Milligan Mine in British Columbia have been temporarily suspended due to lack of sufficient water resources. * The heavyweight financials group fell 0.3 percent and technology shares declined 0.6 percent. * Advancing issues outnumbered declining ones on the TSX by 143 to 96, for a 1.49-to-1 ratio on the upside. * The index was posting nine new 52-week highs and one new low. (Reporting by Fergal Smith; Editing by Bill Trott and Meredith Mazzilli)",reuters.com,http://feeds.reuters.com/reuters/companyNews,https://www.reuters.com/article/canada-stocks/canada-stocks-tsx-posts-record-high-as-energy-marijuana-company-shares-climb-idUSL1N1OR169,2017-12-27T23:48:00.000+02:00,9819.0,0.0,0.0,,-1.0
850,d6970501f4180455abb50f120d4f0c1da7818c2a,UPDATE 1-Russia's Detsky Mir prices IPO at bottom of range-sources,"(Adds details about demand, background)MOSCOW Feb 8 Russia's largest children's goods retailer Detsky Mir has priced its initial public offering at 85 roubles ($1.43) per share, at the bottom of the 85-87 rouble range, two sources familiar with the deal said on Wednesday.The company saw bids for more than 1.5 times the number of shares on offer, drawing strong demand from foreign investors, said another source, who is close to the placement.More than 30 percent of demand came from U.S. investors, around 35 percent from Europe, less than 10 percent from Russia and more than 25 percent from the Middle East and Asia, he said.The source added there were ""hedge funds, long-only investors including sovereign wealth funds"" among the buyers and that nobody would take a dominant position.The IPO is a test of how quickly investor appetite for Russian assets is recovering, after a three-year period when the economy was buffeted by a slump in oil prices, economic slowdown, and Western sanctions imposed over the conflict in Ukraine.The transaction comprised shares sold by the Sistema conglomerate and the Russia-China Investment Fund.Detsky Mir declined to comment. ($1 = 59.4112 roubles) (Reporting by Maria Kiselyova and Olga Sichkar; Editing by Christian Lowe)",reuters.com,http://in.reuters.com/finance/deals,http://in.reuters.com/article/russia-detsky-mir-ipo-price-idINL5N1FT0DF,2017-02-08T02:45:00.000+02:00,850.0,0.0,0.0,,-1.0
3242,41dfed3520e939bff77261f50947154ace5b0a77,"Takata rescue talks extended, even as bankruptcy risk looms","By Taiga Uranaka and Maki Shiraki - TOKYO, April 14 TOKYO, April 14 Potential rescuers of Japan's Takata Corp have extended talks, already in their 14th month, for a deal to take over the air bag maker at the heart of the auto industry's biggest safety recall, people briefed on the process said.Car-parts maker Key Safety Systems Inc (KSS) and Bain Capital LLC are the preferred bidder for Takata, whose faulty air bags have been blamed for at least 16 deaths worldwide.Discussions that include the steering committee tapped by the air bag maker to oversee the search for a financial sponsor, automaker clients, suitors and bankers are now likely to run on until at least end-May, three people told Reuters.The parties have already moved beyond an informal, self-imposed end-March deadline to thrash out a deal.Recent talks, described by two participants as chaotic, have focused on issues such as an indemnity agreement to cover reimbursement costs for air bag recalls, estimated to be as high as $10 billion.KSS, a U.S.-based maker of air bags, seatbelts and steering wheels, and Bain, a U.S. private equity fund, are still conducting due diligence, one of those close to the matter said.Another said KSS - which was bought last year by China's Ningbo Joyson Electronic Corp - and Bain plan to offer around 200 billion yen ($1.8 billion) for Takata.A spokesman for Takata and the steering committee declined to comment. A spokeswoman for KSS also declined to comment.Automakers including Honda Motor Co, which have been footing the bill for recalls dating back to 2008, want Takata restructured through a transparent court-ordered process such as bankruptcy, which would wipe out the firm's shareholder value, four automaker sources have told Reuters.""There's no other option,"" said one automaker executive. ""A privately arranged restructuring would require them to repay billions. They can't afford that.""But Takata, the world's second-biggest air bag maker, is holding out for a private restructuring that would preserve some of the founding Takada familys 60 percent stake.BATTERED REPUTATIONThe clock is ticking for Takata, whose stock has cratered 90 percent since the recall crisis began escalating in early 2014.U.S. federal Judge George Steeh in February cited the potential for Takata to collapse if it couldnt find a buyer.Takata pleaded guilty in Steehs District Court to a felony charge as part of a $1 billion settlement with automakers and victims of its inflators, which can explode with excessive force, blasting shrapnel into passenger areas.The company, which began as a textiles firm and became an early maker of seatbelts, is also trying to settle legal liabilities in the United States, where it faces a class-action lawsuit, and other countries where its air bag inflators have exploded.Takata has denied speculation it would have to seek some form of bankruptcy protection from creditors in the United States or Japan.The company has not been allowed to simply disappear as the auto industry needs it to keep producing the millions of inflators needed to replace recalled air bags - though some automakers have switched to rival suppliers.Also, the government in Tokyo is keen to preserve a major Japanese maker of air bags in a global industry dominated by just three companies.($1 = 108.8300 yen) (Additional reporting by, Taro Fuse, Naomi Tajitsu and Junko Fujita; Editing by William Mallard and Ian Geoghegan)",reuters.com,http://in.reuters.com/finance/deals,http://in.reuters.com/article/takata-restructuring-idINL8N1HA098,2017-04-14T13:56:00.000+03:00,3242.0,0.0,0.0,,-1.0


Unnamed: 0,Uuid,Title,Text,Site,SiteSection,Url,Timestamp,Index,Round,Label,Probability,Estimated
2730,9f6bd526a9c9c9305b18b4d121d723ac878012ef,LVMH's Arnault to take full control of Christian Dior,"By Dominique Vidalon and Gilles Guillaume - PARIS PARIS French billionaire Bernard Arnault will combine the Christian Dior fashion brand with his LVMH luxury goods empire as part of a 12 billion euro ($13 billion) move to simplify his business interests - a restructuring long demanded by other investors.Under a series of complex transactions, LVMH ( LVMH.PA ), the world's largest luxury group, will buy the Christian Dior Couture brand from the Christian Dior ( DIOR.PA ) holding company for 6.5 billion euros, including debt.The deal will unite the 70 year old fashion label worn by film stars from Grace Kelly and Elizabeth Taylor to Jennifer Lawrence and Natalie Portman with the Christian Dior perfume and beauty business already owned by LVMH.The Arnault family, which holds a 47 percent stake in LVMH, will also offer to buy the 25.9 percent of the Christian Dior holding company it does not already own for about 260 euros per share, a premium of 15 percent over Monday's closing price.The transactions ""will allow the simplification of the structures, long requested by the market, and the strengthening of LVMH's Fashion and Leather Goods division,"" the 68-year-old Arnault said in a statement.LVMH shares rose almost 5 percent to a record high of 225 euros as investors welcomed the deals, which they expect to boost LVMH earnings. Dior shares also jumped 13 percent to a new high of 256 euros.""This is a good acquisition for LVMH in our view given the strong brand of Christian Dior, good use of its balance sheet and it reunites the Christian Dior brand with the very profitable perfume operation that LVMH operates,"" Barclays analysts wrote in a research note.LAST BIG DEAL?LVMH said it would use a loan to pay for Christian Dior Couture, which has 198 stores in over 60 countries, and whose sales have doubled over the past five years.Exane BNP Paribas analyst Luca Solca welcomed ""the long awaited LVMH and Dior merger"", which he said was made at a reasonable valuation. Including debt, LVMH is paying 15.6 times Dior's 2017 earnings before interest, taxes, depreciation and amortization (EBITDA).Solca added the deal also reduced the risk of LVMH, whose brands include Louis Vuitton and Hennessy cognac, buying pricey, ""trophy assets"".Finance chief Jean-Jacques Guiony declined to comment on LVMH's future mergers and acquisitions (M&A) policy. But Arnault told the Financial Times that LVMH was not hunting for acquisitions as ""fewer and fewer assets are looking attractive to us. And the best assets are not for sale.""The Dior holding company owns 41 percent of the LVMH group and 100 percent of Christian Dior Couture, the home of the Lady Dior handbag.Arnault's family company will offer 172 euros per share and 0.192 Hermes ( HRMS.PA ) shares for each Dior holding company share. There are potential all-cash and all-share alternatives.Arnault has a stake of about 8 percent in luxury group Hermes ( HRMS.PA ), and Hermes' shares fell from earlier record highs on the prospect of more of the stock coming to the market.LVMH said the overall deal would boost earnings per share by some 3 percent within the first year of its completion, with the transactions expected to close during the second half of 2017.(Additional reporting by Blandine Henault; Editing by Andrew Callus and Mark Potter)",reuters.com,http://feeds.reuters.com/reuters/INbusinessNews,http://in.reuters.com/article/us-lvmh-dior-idINKBN17R0I1,2017-04-25T17:50:00.000+03:00,2730.0,0.0,1.0,,-1.0
4043,04e51a867745700ec080b4e17618d8c37bfe4122,EU mergers and takeovers (May 19),"BRUSSELS May 19 The following are mergers under review by the European Commission and a brief guide to the EU merger process:APPROVALS AND WITHDRAWALS-- U.S. packaging company WestRock to acquire U.S. peer Multi Packaging Solutions (approved May 18)-- Italian cinema operator The Space Cinema, which is controlled by Vue International Holdco Ltd, and Italian peer UCI Italian S.p.A. which is part of Chinese conglomerate Dalian Wanda Group, to set up a joint venture (approved May 18)-- Investment companies TPG and Oaktree to take joint control over Britain's Iona Energy Co, which owns 75 percent of two undeveloped oil fields in the North Sea and that will be active in crude oil production and sale (approved May 18)-- French aircraft engine and aerospace equipment company Safran and China Eastern Airlines Co. Ltd. to form joint venture to provide aircraft maintenance in China (approved May 18)-- Energy company Electricite de France, French state-owned bank Caisse des depots et consignations and Japan's Mitsubishi Corporation to create a joint venture NGM to finance electric mobility projects mainly in France (approved May 18)NEW LISTINGS-- Chinese conglomerate HNA Holding Group Co to acquire Singapore-listed logistics company CWT (notified May 18/deadline June 27/simplified)-- Buyout firm Blackstone and Canada Pension Plan Investment Board (CPPIB) to acquire indirect joint control of U.S. educational content provider Ascend Learning (notified May 18/deadline June 27/simplified)EXTENSIONS AND OTHER CHANGESNoneFIRST-STAGE REVIEWS BY DEADLINEMAY 22-- Investment firms Cinven Capital Management and Canada Pension Plan Investment Board to acquire joint control of Travel Holdings Parent Corporation (notified April 10/deadline May 22)MAY 29-- French EDF to acquire equipment and fuel manufacturing company Areva (notified April 18/deadline May 29)MAY 30-- French media group Vivendi to acquire de facto sole control of Italy's Telecom Italia (notified March 31/deadline extended to May 30 from May 12 after Vivendi offered concessions)MAY 31-- Manufacturing and technology company General Electric's Oil & Gas to acquire oilfield services company Baker Hughes (notified April 20/deadline May 31)JUNE 1-- Waste water company SGAB and Spanish infrastructure company Acciona to acquire 10 percent of Sociedad Concesionaria de la Zona Regable del Canal de Navarra (notified April 21/deadline June 1/simplified)JUNE 2-- Australian bank Macquarie and British pension fund Universities Superannuation Scheme to acquire Green Investment Bank (notified April 24/deadline June 2/simplified)JUNE 7-- German company CWS-Boco, which is part of German firm Haniel, to acquire some of British support services firm Rentokil's workwear and hygiene units (notified April 26/deadline June 7)JUNE 8-- German chemicals company Evonik Industries to acquire U.S. company J.M. Huber Corp's silica business (notified April 27/deadline June 8)JUNE 9-- Private equity firm Hellman & Friedman to acquire Spanish logistics platform Allfunds Bank (notified April 28/deadline June 9/simplified)-- U.S. smartphone chipmaker Qualcomm to acquire Dutch companyr NXP Semiconductors NV (notified April 28/deadline June 9)-- Chinese textiles company Shanghai Shenda to acquire International Automotive Components Group's trim and acoustics unit business (notified April 24/deadline June 9/simplified)JUNE 12-- American healthcare company Johnson & Johnson to acquire Swiss biotech company Actelion (notified April 12/deadline extended to June 12 from May 24 after the companies offered concessions)-- Norwegian debt collection agency Nordic Capital, which is majority owned by Nordic Capital Fund VIII and Swedish peer firm Intrum Justitia to merge (notified April 12/deadline extended to June 12 from May 24 after the companies offered concessions)JUNE 14-- Private equity firms BC Partners and Pollen Street Capital Ltd to jointly acquire UK bank Shawbrook Group plc (notified May 4/deadline June 14/simplified)JUNE 15-- U.S. private equity firm Leonard Green & Partners and the Ontario Municipal Employees Retirement System Primary Pension Plan (OMERS) to acquire joint control of U.S. car repairs company OPE Caliber Holdings (notified May 5/deadline June 15/simplified)-- Austrian refractories materials maker RHI to acquire a controlling stake in Brazilian peer Magnesita Refratarios (notified May 5/deadline June 15)JUNE 21-- Investment bank Goldman Sachs and French investment company Eurazeo to jointly acquire Dominion Web Solutions (notified May 12/deadline June 21/simplified)-- French private equity company Ardian France and real estate agent Jones Lang LaSalle Inc to jointly acquire an office building in France (notified May 12/deadline June 21/simplified)-- French minerals company Imerys to acquire French calcium aluminate cements maker Kerneos (notified May 12/deadline June 21)JUNE 22-- German online fashion retailer Zalando and fashion company Bestseller United to set up a joint venture (notified May 15/deadline June 22/simplified)JUNE 26-- Private equity firms Advent International and Bain Capital Investors to jointly acquire payment services company RatePAY (notified May 17/deadline June 26/simplified)-- Private equity firm Oaktree to acquire German nursing care provider Vitanas P&W (notified May 17/deadline June 26/simplified)GUIDE TO EU MERGER PROCESSDEADLINES:The European Commission has 25 working days after a deal is filed for a first-stage review. It may extend that by 10 working days to 35 working days, to consider either a company's proposed remedies or an EU member state's request to handle the case.Most mergers win approval but occasionally the Commission opens a detailed second-stage investigation for up to 90 additional working days, which it may extend to 105 working days.SIMPLIFIED:Under the simplified procedure, the Commission announces the clearance of uncontroversial first-stage mergers without giving any reason for its decision. Cases may be reclassified as non-simplified - that is, ordinary first-stage reviews - until they are approved. (Reporting by Foo Yun Chee)",reuters.com,http://in.reuters.com/finance/deals,http://in.reuters.com/article/eu-ma-idINL8N1IL4K0,2017-05-19T15:45:00.000+03:00,4043.0,0.0,1.0,,-1.0
5950,4dde03a74e2ac98954002bfc960a5f1d60182649,Canada's OneREIT to be taken private in a C$1.1 billion deal,"August 4, 2017 / 12:57 PM / 3 hours ago Canada's OneREIT to be taken private in a C$1.1 billion deal 1 Min Read (Reuters) - Canada's OneREIT ( ONR_u.TO ) said on Friday it would go private after being bought by SmartREIT and Strathallen Acquisitions Inc in a C$1.1 billion deal, including debt. Under the terms of the deal, shareholders of OneREIT, which owns and operates shopping centers in Canada, will receive C$4.26 per share in cash and SmartREIT unit. The company said it was exploring strategic alternatives earlier this year. Reporting by Ahmed Farhatha in Bengaluru; Editing by Arun Koyyur 0 : 0",reuters.com,http://in.reuters.com/finance/deals,https://in.reuters.com/article/us-onereit-m-a-smartreit-idINKBN1AK1II,2017-08-04T10:57:00.000+03:00,5950.0,0.0,1.0,,-1.0


Unnamed: 0,Uuid,Title,Text,Site,SiteSection,Url,Timestamp,Index,Round,Label,Probability,Estimated
4352,752df0da1a3ecf70f57c017b798b2ceff37f557a,FTC to advise blocking Walgreens deal to buy Rite Aid - CNBC,"June 9 Regulatory authorities are set to advise blocking U.S. drugstore chain Walgreens Boots Alliance Inc's deal to buy smaller rival Rite Aid Corp, CNBC reported on Friday, citing a report.The companies have been waiting for a year-and-a-half for approval from the Federal Trade Commission (FTC) since the initial offer made in 2015.In that time, the closing date of the deal has been postponed repeatedly and the offer price reduced to $6.50 to $7.00 per Rite Aid share, down from $9.The deal would have helped Walgreens widen its U.S. footprint and negotiate for lower drug costs. (Reporting by Sruthi Ramakrishnan in Bengaluru; Editing by Shounak Dasgupta)",reuters.com,http://in.reuters.com/finance/deals,http://in.reuters.com/article/rite-aid-ma-walgreens-boots-idINL3N1J64ON,2017-06-09T14:48:00.000+03:00,4352.0,0.0,2.0,,-1.0
1672,476ed99eb49bd466258784067ac62dd7815e05ab,"UPDATE 1-Hansteen to sell German, Dutch industrial properties for $1.4 bln","Company News - Mon Mar 20, 2017 - 5:01am EDT UPDATE 1-Hansteen to sell German, Dutch industrial properties for $1.4 bln (Adds details, background, share movement) March 20 Britain's Hansteen Holdings has agreed to sell its German and Dutch industrial property portfolios for 1.28 billion euros ($1.38 billion) to a venture between Blackstone Group LP and M7 Real Estate. The price represents a premium of about 6 percent, or roughly 76 million euros, to the assets' valuations at the end of 2016, Hansteen said in a statement on Monday. Hansteen's shares rose more than 6 percent, before paring gains to trade up 3 percent at 125.55 pence at 0850 GMT. They were the top gainers on London's midcap index. ""This is a compelling opportunity to crystallise both the revaluation gains from these German and Dutch assets achieved by our active asset management and the gains from foreign exchange movements,"" Hansteen joint chief executives Morgan Jones and Ian Watson said. Last year, the industrial market outperformed all other European real estate sectors, including offices and retail, data from property consultant CBRE showed, as the sector benefited from higher demand for warehouses from retailers expanding their online operations. Over the fourth quarter, European commercial real estate deals reached a record high of 86.8 billion euros, boosted largely by a buoyant Germany market and growth in the Netherlands, according to the data. Hansteen, a UK real estate investment trust, said that the sale was expected to complete before the end of June and that it was advised by property consultant JLL. The sale leaves Hansteen with its UK business, where the market has seen some turbulence after Britain voted to leave the European Union. However, Hansteen said it had not noticed any significant effect on demand for industrial space following the June 23 vote. ""Across the UK, we are experiencing pockets of rental growth and shorter incentives being offered to tenants as demand intensifies,"" the company said. ($1 = 0.9288 euros) (Reporting by Esha Vaish in Bengaluru; Editing by Jason Neely and Alexander Smith) Next In Company News",reuters.com,http://feeds.reuters.com/reuters/companyNews,http://www.reuters.com/article/hansteen-divestiture-idUSL5N1GX0ZS,2017-03-20T16:01:00.000+02:00,1672.0,0.0,2.0,,-1.0
5052,370eed1e30c0117362432816475315a18ab67463,Italy could consider taking small stake in Alitalia - minister,"July 19, 2017 / 1:38 PM / 8 minutes ago Italy could consider taking small stake in Alitalia - minister Reuters Staff 1 Min Read People walk in the Alitalia departure hall during a strike by Italy's national airline Alitalia workers at Fiumicino international airport in Rome, Italy July 24, 2015. Max Rossi - RTX1LMNG ROME (Reuters) - Italy could consider taking a small stake in struggling airline Alitalia, Transport Minister Graziano Delrio said on Wednesday. ""We are against nationalising (the airline) but the state taking a small stake could be a solution,"" Delrio told a parliamentary commission. He added that the special administrators appointed to run Alitalia after it filed for bankruptcy could stay in their roles for longer than originally planned. Reporting by Alberto Sisto, writing by Isla Binnie 0 : 0",reuters.com,http://feeds.reuters.com/Reuters/UKBusinessNews?format=xml,http://uk.reuters.com/article/uk-italy-alitalia-idUKKBN1A41GI,2017-07-19T16:38:00.000+03:00,5052.0,0.0,2.0,,-1.0


[9819.0, 850.0, 3242.0, 2730.0, 4043.0, 5950.0, 4352.0, 1672.0, 5052.0]

In [None]:
# testing data: manually labeled articles of current round
testing_data = df.loc[(df['Round'] == m)]
len(testing_data)

In [None]:
# classical model fitting:

# all labeled as samples
#training_data = pd.concat([set_0, set_1, set_2])

#m += 1
#m

# classical model fitting:
#all_labeled_data = df.loc[(df['Round'] <= m)].reset_index()
#recall_scores, precision_scores, f1_scores = MultinomialNaiveBayes.make_mnb(all_labeled_data)
#recall_score = sum(recall_scores)/len(recall_scores)
#print('recall: {}'.format(recall_score))
#precision_score = sum(precision_scores)/len(precision_scores)
#print('precision: {}'.format(precision_score))
#f1_score = sum(f1_scores)/len(f1_scores)
#print('f1 score: {}'.format(f1_score))

# stratified sampled:
#training_data = pd.concat([set_0[:strat], set_1[:strat], set_2[:strat]])

#len(training_data)

We now check (and correct if necessary) the next auto-labeled articles.

## Multinomial Naive Bayes Classification: ##

In [None]:
# use sklearn's CountVectorizer
# cv = False
cv = True

# call script with manually labeled and manually unlabeled samples
%time classes, class_count, class_probs = MNBInteractive.estimate_mnb(training_data, testing_data, cv)

In [None]:
m

In [None]:
# annotate highest estimated probability for every instance
maxima = []

for row in class_probs:
    maxima.append(np.amax(row))

In [None]:
#maxima

In [None]:
# save class_probs array
#with open('../obj/'+ 'array_class_probs_round_{}_stratified'.format(m) + '.pkl', 'wb') as f:
    pickle.dump(maxima, f, pickle.HIGHEST_PROTOCOL)

In [None]:
# sort list in descending order
maxima.sort(reverse=True)

# convert list to array
probas = np.asarray(maxima)

n_bins = 50

fig, ax = plt.subplots(figsize=(8, 4))

# plot the cumulative histogram
n, bins, patches = ax.hist(probas, n_bins, density=1, histtype='step',
                           cumulative=True, facecolor='darkred')

ax.grid(True)
#ax.set_title('Cumulative distribution of highest estimated probability')
ax.set_xlabel('Highest estimated probability')
ax.set_ylabel('Fraction of articles with this highest estimated probability')
#plt.axis([0.5, 1, 0, 0.02])
#ax.set_xbound(lower=0.5, upper=0.99)
plt.show()

In [None]:
plt.savefig('..\\visualization\\proba_after_round_{}_stratified.png'.format(m))

We annotate each article's estimated class with its probability in columns 'Estimated' and 'Probability':

In [None]:
# series of indices of recently estimated articles 
indices_estimated = df.loc[df['Round'] == m, 'Index'].tolist()

n = 0    
for row in class_probs:
    for i in range(0, len(classes)):
        index = indices_estimated[n]
        # save estimated label
        if np.amax(row) == row[i]:
            df.loc[index, 'Estimated'] = classes[i]
            # annotate probability
            df.loc[index, 'Probability'] = row[i]
    n += 1

In [10]:
m = 16

In [11]:
print('###############')
zero_0 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 0) & (df['Label'] == 0)])
zero_0
zero_1 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 0) & (df['Label'] == 1)])
zero_1
zero_2 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 0) & (df['Label'] == 2)])
zero_2
print('###############')
one_0 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 1) & (df['Label'] == 0)])
one_0
one_1 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 1) & (df['Label'] == 1)])
one_1
one_2 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 1) & (df['Label'] == 2)])
one_2
print('###############')

two_0 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 2) & (df['Label'] == 0)])
two_0
two_1 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 2) & (df['Label'] == 1)])
two_1
two_2 = len(df.loc[(df['Round'] == m) & (df['Estimated'] == 2) & (df['Label'] == 2)])
two_2
print('###############')

total = zero_0 + zero_1 + zero_2 + one_0 + one_1 + one_2 + two_0 + two_1 + two_2

tp_0 = zero_0
tp_0
tn_0 = one_1 + one_2 + two_1 + two_2
tn_0
fp_0 = zero_1 + zero_2
fp_0
fn_0 = one_0 + two_0
fn_0
print('###############')

tp_1 = one_1
tp_1
tn_1 = zero_0 + zero_2 + two_0 + two_2
tn_1
fp_1 = one_0 + one_2
fp_1
fn_1 = zero_1 + two_1
fn_1
print('###############')

tp_2 = two_2
tp_2
tn_2 = zero_0 + zero_1 + one_0 + one_1
tn_2
fp_2 = two_0 + two_1
fp_2
fn_2 = zero_2 + one_2
fn_2
print('###############')

prec_0 = tp_0 / (tp_0 + fp_0) * 100
prec_0
rec_0 = tp_0 / (tp_0 + fn_0) * 100
rec_0
acc_0 = (tp_0 + tn_0) / total * 100
acc_0
print('###############')

prec_1 = tp_1 / (tp_1 + fp_1) * 100
prec_1
rec_1 = tp_1  / (tp_1 + fn_1) * 100
rec_1
acc_1 = (tp_1 + tn_1) / total * 100
acc_1
print('###############')

prec_2 = tp_2 / (tp_2 + fp_2) * 100
prec_2
rec_2 = tp_2 / (tp_2 + fn_2) * 100
rec_2
acc_2 = (tp_2 + tn_2) / total * 100
acc_2
print('###############')

(prec_1 + prec_2 + prec_0) / 3
(rec_1 + rec_2 + rec_0) / 3
(acc_1 + acc_2 + acc_0) / 3

###############


5

0

1

###############


2

1

0

###############


1

0

0

###############


5

1

1

3

###############


1

7

2

0

###############


0

8

1

1

###############


83.33333333333334

62.5

60.0

###############


33.33333333333333

100.0

80.0

###############


0.0

0.0

80.0

###############


38.88888888888889

54.166666666666664

73.33333333333333

In [None]:
# save round
df.to_csv('../data/interactive_labeling_round_{}.csv'.format(m),
      sep='|',
      mode='w',
      encoding='utf-8',
      quoting=csv.QUOTE_NONNUMERIC,
      quotechar='\'')

In [None]:
#df.loc[:10]

## Manual Labeling: ##

Find new threshold for labeling:

In [None]:
threshold = 0.9997

n = 0
for max in maxima:
    if max < threshold:
        n += 1
n

In [None]:
print('Number of articles with estimated probability < {}: {}'.format(threshold, len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold)])))

Check articles with probability under threshold:

In [None]:
# pick articles with P < 0.99:
label_next = df.loc[(df['Label'] == -1) & (df['Probability'] < threshold), 'Index'].tolist()

In [None]:
len(label_next)

In [None]:
print('Estimated labels to be checked: class 0: {}, class 1: {}, class 2: {}'
      .format(len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 0.0)]), 
              len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 1.0)]),
              len(df.loc[(df['Label'] == -1) & (df['Probability'] < threshold) & (df['Estimated'] == 2.0)])))

In [None]:
# increment round number
m += 1
print('This round number: {}'.format(m))

PLEASE READ THE FOLLOWING ARTICLES AND ENTER THE CORRESPONDING LABELS:

In [None]:
for index in label_next:
    show_next(index)

In [None]:
#df.loc[(df['Round'] == m) & (df['Index'].isin(label_next)), 'Round'] = m

In [None]:
print('Number of manual labels in round no. {}:'.format(m))
print('0:{}, 1:{}, 2:{}'.format(len(df.loc[(df['Label'] == 0) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 1) & (df['Round'] == m)]), len(df.loc[(df['Label'] == 2) & (df['Round'] == m)])))

print('Number of articles to be corrected in this round: {}'.format(len(df.loc[(df['Label'] != -1) & (df['Estimated'] != -1) & (df['Round'] == m) & (df['Label'] != df['Estimated'])])))

In [None]:
# save intermediate status
df.to_csv('../data/interactive_labeling_round_{}_temp.csv'.format(m),
      sep='|',
      mode='w',
      encoding='utf-8',
      quoting=csv.QUOTE_NONNUMERIC,
      quotechar='\'')

## Resubstitution error: Multinomial Naive Bayes ##

In [None]:
X_train_test = df.loc[df['Label'] != -1, 'Title'] + ' ' + df.loc[df['Label'] != -1, 'Text']
y_train_test = df.loc[df['Label'] != -1, 'Label']

In [None]:
# discard old indices
y_train_test = y_train_test.reset_index(drop=True)
X_train_test = X_train_test.reset_index(drop=True)

In [None]:
# use my own BagOfWords python implementation
stemming = True
rel_freq = True
extracted_words = BagOfWords.extract_all_words(X_train_test)
vocab = BagOfWords.make_vocab(extracted_words)

In [None]:
# fit the training data and return the matrix
training_data = BagOfWords.make_matrix(extracted_words, vocab, rel_freq, stemming)
testing_data = training_data

In [None]:
# Naive Bayes
classifier = MultinomialNB(alpha=1.0e-10, fit_prior=False, class_prior=None)

In [None]:
# optional, nur bei resubstitutionsfehler

n = 0
for i in range(len(y_train_test)):
    if y_train_test[i] != predictions[i]:
        n += 1
        print('error no.{}'.format(n))
        print('prediction at index {} is: {}, but actual is: {}'.format(i, predictions[i], y_train_test[i]))
        print(X_train_test[i])
        print(y_train_test[i])
        print()
if n==0:
    print('no resubstitution error :-)')
else:
    print('number of wrong estimated articles: {}'.format(n))

In [None]:
print('End of this round (no. {}):'.format(m))
print('Number of manually labeled articles: {}'.format(len(df.loc[df['Label'] != -1])))
print('Number of manually unlabeled articles: {}'.format(len(df.loc[df['Label'] == -1])))

In [None]:
# save this round to csv
df.to_csv('../data/interactive_labeling_round_{}_neu.csv'.format(m),
      sep='|',
      mode='w',
      encoding='utf-8',
      quoting=csv.QUOTE_NONNUMERIC,
      quotechar='\'')

NOW PLEASE CONTINUE WITH PART II.
REPEAT UNTIL ALL SAMPLES ARE LABELED.