Given a classification problem, there is always a labeled data set needed first to apply a machine learning model and make predictions possible. The larger the labeled data set is, the better are generally the predictions. However, to get there, each single data element must first be classified manually. Depending on the type of data, this procedure can be very time-consuming, for example if longer texts have to be read.
In this thesis we want to present an alternative data labeling method that allows to label a larger amount of data in a shorter time.
\section{Goals}
\label{sec:goals}
\jk{Ein Satz welcher das Problem beschreibt, dannach dann runtergebrochen in Teilaufgaben}
We want to compare a conventional method of data labeling with an alternative, incremental method using the following example: The aim is to investigate news articles about recent mergers ('mergers and acquisitions') and to classify them accordingly. With the help of the labeled data set, different classification models will be applied and optimized so that a prediction about future news articles will be possible.
As a source for our initial data set, RSS feeds from established business news agencies such as \textit{Reuters} or \textit{Bloomberg} come into consideration. However, when crawling RSS feeds, it is not possible to retrieve news from a longer period in the past. Since we want to analyze news of the period of 12 months, we obtain the data set from the provider \textit{webhose.io}\footnote{\url{<https://webhose.io/>}}. It offers access to English news articles from sections like \textit{Financial News}, \textit{Finance} and \textit{Business} at affordable fees compared to the news agencies' offers. As we are only interested in reliable sources, we limit our request to the websites of the news agengies \textit{Reuters, Bloomberg, Financial Times, CNN, The Economist} and \textit{The Guardian}.
\jk{Was muss insgesamt gemacht werden, welche Teilprobleme müssen addressiert werden. Alternativen besprechen, Entscheidungen fällen basierend auf Kriterien. Hier kommt Deine Arbeit hin, kein Related work oder Methoden die es schon gibt. Nur falls man es Vergleicht, dann relevant.}
Before we can start with the data processing, we have to identify and select appropriate data. We downloaded news articles of 12 months (year 2017) from the website \textit{webhose.io}.
To retrieve our data, we make the following request\footnote{On \url{https://docs.webhose.io/docs/filters-reference} you can learn more about the possible filter settings of \textit{webhose.io.}}:\\\\
\texttt{
site:(reuters.com OR ft.com OR cnn.com OR economist.com\\
\noindent\hspace*{12mm}%
OR bloomberg.com OR theguardian.com)\\
site\_category:(financial\_news OR finance OR business)\\
\\
timeframe:january2017-december2017}\\
\\
The requested data was downloaded in September 2018 with JSON as file format. Every news article is saved in a single file, in total 1.478.508 files were downloaded (4,69 GiB).
Among others, one JSON file contains the information shown in the following example :\\
"title": "EU antitrust ruling on Microsoft buy of GitHub due by October 19",
"text": "BRUSSELS (Reuters)-EU antitrust regulators will decide by Oct. 19 whether to clear U.S. software giant Microsoft's $7.5 billion dollar acquisition of privately held coding website GitHub. Microsoft, which wants to acquire the firm to reinforce its cloud computing business against rival Amazon, requested European Union approval for the deal last Friday, a filing on the European Commission website showed on Monday. The EU competition enforcer can either give the green light with or without demanding concessions, or it can open a full-scale investigation if it has serious concerns. GitHub, the world's largest code host with more than 28 million developers using its platform, is Microsoft's largest takeover since the company bought LinkedIn for $26 billion in 2016. Microsoft Chief Executive Satya Nadella has tried to assuage users' worries that GitHub might favor Microsoft products over competitors after the deal, saying GitHub would continue to be an open platform that works with all the public clouds. Reporting by Foo Yun Chee; Editing by Edmund Blair",
"language": "english",
"crawled": "2018-09-18T01:52:42.035+03:00"
}
\end{lstlisting}
As \textit{webhose.io} is a secondary source for news articles and only crawls the news feeds itself, it may occur that some RSS feeds are not parsed correctly or a article is tagged with a wrong topic as \textit{site categories}. The downloaded files also contain blog entries, user comments, videos or graphical content and other spam which we have to filter out. We also do not need pages quoting Reuters etc.. Besides this, we are only interested in English news articles.\\
After we have filtered out all the irrelevant data, we receive a data set of \textbf{41.790} news articles that we store in multiple csv files\footnote{All csv files have a total size of 109 MB.}, one for each month.
\subsection{Selecting the Working Data Set}
\label{subsec:data_selection}
We have received a different number of articles from each month. Because we want the items for our initial working data set to be fairly distributed throughout the year, we select 833 articles from each month\footnote{We select 834 from every third month: (8 * 833) + (4 * 834) = 10.000.} to create a csv file containing \textbf{10.000} articles with a total size of 27 MB.
The textual corpus\footnote{We describe the initial data set in detail in Chapter \ref{chap:design}, Section \ref{subsec:data_selection}.} contains of the news articles' headlines and plain texts, if not specified otherwise. For the sake of simplicity we use the unigram model for our analysis.
\subsection{Sources for News Articles}
As illustrated in Table \ref{table:sources}, the main source for news articles in our data set is \textit{Reuters.com}. This is due to the fact that Webhose.io does not have equal access to the desired sources and, above all, the news from Reuters.com has been parsed with the required quality.
\begin{center}
\begin{table}[h]
\centering
\begin{tabular}{|l|r|}
\hline
reuters.com & 94\%\\
theguardian.com & 3\%\\
economist.com & 2\%\\
bloomberg.com & < 1\%\\
cnn.com & < 1\%\\
ft.com & < 1\%\\
\hline
\end{tabular}
\caption{Article sources in the data set}
\label{table:sources}
\end{table}
\end{center}
\al{Ist es ein Problem, dass Reuters die Hauptquelle ist?}
The average length of the news articles examined is 2476 characters\footnote{headlines excluded}. The distribution of the article length in the dataset is shown in Figure \ref{fig:article_length}.
The 10 most common words in the data set are: \textit{'percent', 'fitch', 'billion', 'new', 'business', 'market', 'next', 'million', 'ratings', 'investors'}.
% toDo
\al{Ist nur von den 100 ersten Artikeln im Datensatz, muss noch ausgetauscht werden.}
\al{Hier geht es noch um den alten Datensatz, muss noch ausgetauscht werden.}
'Comcast' is the most frequently used company name in the data set. Figure \ref{fig:company_names} shows that big companies dominate the reporting about mergers. In order to use a fairly distributed data set for model selection, we limit the number of articles used to 3 per company name.
First, we label a slightly smaller data set in a conventional way. The dataset consists of 1497 news articles, which were downloaded via \textit{webhose.io}. The dataset contains news articles from different Reuters' RSS feeds dating from the period of one month \footnote{The timeframe was May 25 - June 25 2018, retrieved on June 25 2018.}. Here, we only filter out articles that contain at least one of the keywords \textit{'merger', 'acquisition', 'take over', 'deal', 'transaction'} or \textit{'buy'} in the heading.
With the following query\footnote{Please read more about the possible filter settings on the website \url{https://docs.webhose.io/docs/filters-reference}} we download the desired data from \textit{webhose.io}:\\\\
Some article texts were difficult to classify even when read carefully.
Here are a few examples of the difficulties that showed up:
\begin{itemize}
\item\textit{'Company A acquires more than 50\% of the shares of company B.'}\\ => How should share sales be handled? Actually, this means a change of ownership, even if it is not a real merger.
\item\textit{'Company X will buy/wants to buy company Y.'}\\=> Will the merger definitely take place? On what circumstances does it depend?
\item\textit{'Last year company X and company Y merged. Now company A wants to invest more in renewable energies.'}\\ => Only an incidental remark deals with a merger that is not taking place right now. The main topic of the article is about something completely different.
\end{itemize}
These difficulties led to the idea of using different labeling classes, which we finally implemented in the interactive labeling method.
For the interactive labeling method, we use the data set of 10.000 articles from a whole year described in Chapter \ref{chap:design}, Section \ref{sec:data_selection}.
%Es ist wahrscheinlich dann man nur Merger mit vielen Artikeln hat => Das könnte man minimieren indem man “stratified” sampling macht => Zuerst NER machen, danach fair über Klassen randomisieren => wähle 10 Artikel von 100 Kategorien aus => 10 Kategorien auswählen => darunter zufällig ein Artikel . Labeln von 1\% aller Artikel
%1) Erste Modelle bauen z.b. Bayes . Auf alle Artikel anwenden => Wahrscheinlichkeit pro Klasse Vektor: (K1, K2, … , K6)
%Klare Fälle: Kx > 80\% und alle anderen Ky < 10\% (mit x in {1-6}, y != x)
%=> Label übernehmen => wie viele Fälle sind eindeutig?
%Behauptung: 10\% aller Artikel sind eindeutig
%Stichprobenartig überprüfen => 10 Artikel random auswählen von jeder Klasse
%Identifikation von äußert unklaren Fällen
%Mehr als eine Klasse hat ähnliche Wahrscheinlichkeit
%(5\%, 5\%, 5\%, …) => (80\%, 80\%, 0\%, 0\%, …)
%z.b. 100 Artikel angucken und manuell label
%=> Wiederhole ich 3-4 mal gehe zu Schritt 1) (Modell bauen)
%=> 95\% aller Fälle sind jetzt klar.
%=> warum gehen die 5\% nicht? Stichprobenartig Artikel anschauen
%Falls das nicht klappt, Modelle oder Preprozessing (z.b. NER) verbessern
Hiermit versichere ich an Eides statt, dass ich die vorliegende Arbeit im Bachelorstudiengang Wirtschaftsinformatik selbstständig verfasst und keine anderen als die angegebenen Hilfsmittel – insbesondere keine im Quellenverzeichnis nicht benannten Internet-Quellen – benutzt habe. Alle Stellen, die wörtlich oder sinngemäß aus Veröffentlichungen entnommen wurden, sind als solche kenntlich gemacht. Ich versichere weiterhin, dass ich die Arbeit vorher nicht in einem anderen Prüfungsverfahren eingereicht habe und die eingereichte schriftliche Fassung der auf dem elektronischen Speichermedium entspricht.