my python classes for text mining, machine learning models, …
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Anne Lorenz 7c3353edab update labeling / documentation 4 years ago
data update labeling / documentation 4 years ago
obj update labeling / documentation 4 years ago
src update labeling / documentation 4 years ago
stanford-ner-2018-02-27 Update 5 years ago
visualization update labeling / documentation 4 years ago
.gitignore Initial commit 5 years ago
README.md corrected calculation of precision 5 years ago

README.md

Predictor for Company Mergers

(Bachelorthesis Anne)

This project contains python classes for text mining and machine learning models to recognize company mergers in news articles. The csv file classification_labelled_corrected.csv contains 1497 labeled news articles from Reuters.com and is used for the machine learning models.

Best F1 score results:

  • Support Vector Machines Classifier (SVM):
    F1 score: 0.894
    Best parameters set found on development set: {'SVC__C': 0.1, 'SVC__gamma': 0.01, 'SVC__kernel': 'linear', 'perc__percentile': 50}

  • Naive Bayes Classifier:
    F1 score: 0.841 (average)
    Parameters: SelectPercentile(100), own Bag of Words implementation, 10-fold cross validation

The complete documentation can be found in the latex document in the thesis folder.

Installation under Windows

$ pip install xy

Requirements

pandas==0.20.1
nltk==3.2.5
webhoseio==0.5
numpy==1.14.0
graphviz==0.9
scikit_learn==0.19.2

Usage

The scripts can be called separately. You need to enter a valid personal key for webhose.io before you call Requester.py. To run NER.py you need to change the path to the JAVAHOME environment variable in find_companies method.


Author: Anne Lorenz / Datavard AG

Project Status: work in progress