c85ce71e24 | ||
---|---|---|
stanford-ner-2018-02-27 | ||
thesis | ||
.gitignore | ||
BagOfWords.py | ||
CosineSimilarity.py | ||
DecisionTree.py | ||
FilterKeywords.py | ||
JSONHandler.py | ||
NER.py | ||
NaiveBayes.py | ||
NaiveBayes_Interactive.py | ||
README.md | ||
Requester.py | ||
SVM.py | ||
classification_labelled_corrected.csv |
README.md
Anne's Bachelor Thesis
State: October 2018 (in progress)
My python classes for text mining, machine learning models, … The scripts can be called separately.
Best F1 score results were:
SVM
F1 score: 0.8944166649330559 best parameters set found on development set: {'SVC__C': 0.1, 'SVC__gamma': 0.01, 'SVC__kernel': 'linear', 'perc__percentile': 50}
Naive Bayes
parameters: SelectPercentile(25), own BOW implementation, 10-fold cross validation F1 score: min = 0.7586206896551724, max = 0.8846153846153846, average = 0.8324014738144634
The complete documentation can be found in the latex document in the thesis folder.
The csv file 'classification_labelled_corrected.csv' contains 1497 labeled news articles from Reuters.com and is used for the machine learning models.
Note: Please enter a valid webhose personal key before you call 'Requester.py'. Also, please change the path to your JAVAHOME environment variable in 'NER.find_companies' method.
example:
set paths
java_path = "C:\Program Files (x86)\Java\jre1.8.0_181" os.environ['JAVAHOME'] = java_path
Requirements
pandas==0.20.1 nltk==3.2.5 webhoseio==0.5 numpy==1.14.0 graphviz==0.9 scikit_learn==0.19.2
Installation under Windows
pip install XY
Installation under UBUNTU
apt-get install XX