3297beed0a | ||
---|---|---|
stanford-ner-2018-02-27 | ||
thesis | ||
.gitignore | ||
BagOfWords.py | ||
CosineSimilarity.py | ||
DecisionTree.py | ||
FilterKeywords.py | ||
JSONHandler.py | ||
NER.py | ||
NaiveBayes.py | ||
NaiveBayes_Interactive.py | ||
README.md | ||
Requester.py | ||
SVM.py | ||
classification_labelled_corrected.csv |
README.md
Predictor for Company Mergers
(Bachelorthesis Anne)
This project contains python classes for text mining, machine learning models, …
The csv file classification_labelled_corrected.csv contains 1497 labeled news articles from Reuters.com and is used for the machine learning models.
Best F1 score results:
-
Support Vector Machines Classifier (SVM):
F1 score: 0.894
Best parameters set found on development set: {'SVC__C': 0.1, 'SVC__gamma': 0.01, 'SVC__kernel': 'linear', 'perc__percentile': 50} -
Naive Bayes Classifier:
F1 score: 0.832 (average)
Parameters: SelectPercentile(25), own Bag of Words implementation, 10-fold cross validation
The complete documentation can be found in the latex document in the thesis folder.
Installation under Windows
$ pip install xy
Requirements
pandas==0.20.1
nltk==3.2.5
webhoseio==0.5
numpy==1.14.0
graphviz==0.9
scikit_learn==0.19.2
Usage
The scripts can be called separately. You need to enter a valid personal key for webhose.io before you call Requester.py. To run NER.py you need to change the path to the JAVAHOME environment variable in find_companies method.
Author: Anne Lorenz / Datavard AG
Project Status: work in progress