41 lines
1.3 KiB
Markdown
41 lines
1.3 KiB
Markdown
# Predictor for Company Mergers
|
|
(Bachelorthesis Anne)
|
|
|
|
This project contains python classes for text mining and machine learning models to recognize company mergers in news articles.
|
|
The csv file *classification_labelled_corrected.csv* contains 1497 labeled news articles from *Reuters.com* and is used for the machine learning models.
|
|
|
|
**Best F1 score results**:
|
|
|
|
* **Support Vector Machines Classifier (SVM):**
|
|
F1 score: 0.894
|
|
Best parameters set found on development set:
|
|
{'SVC\__C': 0.1, 'SVC\__gamma': 0.01, 'SVC\__kernel': 'linear', 'perc\__percentile': 50}
|
|
|
|
* **Naive Bayes Classifier**:
|
|
F1 score: 0.841 (average)
|
|
Parameters: SelectPercentile(100), own Bag of Words implementation, 10-fold cross validation
|
|
|
|
The complete documentation can be found in the latex document in the *thesis* folder.
|
|
|
|
## Installation under Windows
|
|
```bash
|
|
$ pip install xy
|
|
```
|
|
### Requirements
|
|
|
|
pandas==0.20.1
|
|
nltk==3.2.5
|
|
webhoseio==0.5
|
|
numpy==1.14.0
|
|
graphviz==0.9
|
|
scikit_learn==0.19.2
|
|
|
|
## Usage
|
|
The scripts can be called separately.
|
|
You need to enter a valid personal key for *webhose.io* before you call *Requester.py*.
|
|
To run *NER.py* you need to change the path to the *JAVAHOME* environment variable in *find_companies* method.
|
|
|
|
---
|
|
**Author:** Anne Lorenz / Datavard AG
|
|
|
|
**Project Status:** work in progress |