my python classes for text mining, machine learning models, …
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
Anne Lorenz cbfbdffdb7 corrected calculation of precision 5 years ago
stanford-ner-2018-02-27 Update 5 years ago
thesis Update 5 years ago
.gitignore deleted .gitignore 5 years ago
BagOfWords.py Update 5 years ago
CosineSimilarity.py added requirements and some things 5 years ago
DecisionTree.py removed csvHandler.py 5 years ago
FilterKeywords.py dict -> defaultdict 5 years ago
JSONHandler.py corrected calculation of precision 5 years ago
NER.py Update 5 years ago
NaiveBayes.py corrected calculation of precision 5 years ago
NaiveBayes_Interactive.py corrected calculation of precision 5 years ago
README.md corrected calculation of precision 5 years ago
Requester.py changes 5 years ago
SVM.py removed csvHandler.py 5 years ago
classification_labelled_corrected.csv added new files 5 years ago

README.md

Predictor for Company Mergers

(Bachelorthesis Anne)

This project contains python classes for text mining and machine learning models to recognize company mergers in news articles. The csv file classification_labelled_corrected.csv contains 1497 labeled news articles from Reuters.com and is used for the machine learning models.

Best F1 score results:

  • Support Vector Machines Classifier (SVM):
    F1 score: 0.894
    Best parameters set found on development set: {'SVC__C': 0.1, 'SVC__gamma': 0.01, 'SVC__kernel': 'linear', 'perc__percentile': 50}

  • Naive Bayes Classifier:
    F1 score: 0.841 (average)
    Parameters: SelectPercentile(100), own Bag of Words implementation, 10-fold cross validation

The complete documentation can be found in the latex document in the thesis folder.

Installation under Windows

$ pip install xy

Requirements

pandas==0.20.1
nltk==3.2.5
webhoseio==0.5
numpy==1.14.0
graphviz==0.9
scikit_learn==0.19.2

Usage

The scripts can be called separately. You need to enter a valid personal key for webhose.io before you call Requester.py. To run NER.py you need to change the path to the JAVAHOME environment variable in find_companies method.


Author: Anne Lorenz / Datavard AG

Project Status: work in progress