This commit is contained in:
Julian M. Kunkel 2019-09-20 12:08:25 +01:00
commit 29f645f3d4
5 changed files with 280 additions and 0 deletions

crawler/.gitignore vendored Normal file
View File

@ -0,0 +1,3 @@

crawler/ Normal file
View File

@ -0,0 +1,124 @@
Webcrawler and search for some HPC material
Using the webcrawler
There are three scripts in this directory, which are supposed to be run in sequence:
$ ./fetch-data
$ ./index-data
$ ./ some interesting query
###Step 1: Fetch the data
The first script is a web crawler that downloads "interesting" stuff from some HPC related websites.
###Step 2: Process the data
The second script walks the file hierarchy created by the first script and turns the HTML and XML markup into pure text.
From this data, it builds to archive files, one is a two-column CSV file (path and content),
the other is meant to be used as an input file for an elasticsearch database.
See the script comment in `index-data` for details.
###Step 3: Query the data
The third scrip uses Python's fuzzywuzzy module to match a given query to the contents of the archive file produced by the second script.
Performance of the search
I am not impressed by it.
Sorry, but I really cannot sell this as a success story.
The problems that I see:
* The match percentage depends only on the number of query words that are found in the document.
As such, the result list does not distinguish between a document that uses the query words once and one that uses them over and over again.
* The match percentage does not take into account whether the query words appear close together or not.
* It is completely ignored *where* in the text the query words appear.
A page that lists for example "Python" in its title is scored equal to a page that contains a link with the tool-tip "a small python script to frobnicate foos".
Both will be listed as 100 percent matches to the query "python".
These problems severely limit the usefulness of the query feature, as shown in the examples below.
###Some example queries
####Searching for a Python introduction
$ ./ python introduction | head
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
The first document matches because it contains the sentence "C-based 3rd-party netCDF APIs for other languages include Python, Ruby, Perl, Fortran-2003, MATLAB, IDL, and R".
The document is an introduction, all right, but for NetCDF, not Python.
Likewise, the FAQ (second document) mentions several times that NetCDF can be used with Python, and it contains two links that lead to "introduction" documents.
The third link is even more obscure.
Funnily enough, it mentions the NetCDF Python library again (once), but the word "introduction" does not even appear in the rendered HTML document.
It is necessary to load the page's source HTML code to find out that there is a link contained within that page that has the tool-tip
"Short introduction to the OpenStack Swift Storage system",
and a second link to which is praised in the tool-tip as leading to "a short introduction".
Googling for "python introduction" yields much better results.
####Trying to solve a batch processing problem
$ ./ batch job not starting | head
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
The first result describes debugging with ARM DDT, not how to troubleshoot problems with batch jobs.
The second result is actually the same page as the first.
The third and seventh results are actually somewhat useful, they are the online version of `man srun` and `man sbatch`.
The other results are not useful, as they are just further man pages of the other slurm commands,
and have little information to give on troubleshooting jobs that won't start.
At least, the links point the user to the correct software.
Google, with the same query, does not fare any better, as its results are dominated by Microsoft's Dynamics AX software.
####Trying to run a program on several nodes
$ ./ run program on several nodes in parallel | head
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Match: 100% in URL: data/
Finally a success, the second result provides the required informations.
Google provides a wide variety of more or less helpful links, some of which are significantly better geared towards people without a solid education in HPC.
The more specialized the queries were, the better the results of `./` became.
However, the insensitivity of our algorithm to the locations of the matches and their number, frequently allows entirely unrelated results to float to the top.
Google does not suffer from this problem.
Google only defeats itself whenever there is a major non-HPC technology/interpretation/thing that dominates its result list, pushing the useful results out of sight.
The most important improvement would be to weight in whether a match occurs within a link or its tool-tip.
The next important improvement would be to weight where the match occurs within the text (title/introductory paragraphs/body/footnotes).
The third important improvement would be to weight in whether the matches occur in close proximity or not.
The fourth important improvement would be to consider the amount of matches (passing its relative frequency through a log() function or similar).

crawler/ Executable file
View File

@ -0,0 +1,49 @@
#! /usr/bin/env python3
# [(-l | --library) <library-path>] [<query> ...]
# Search the library archive (a .csv file produced by index-data) for the words in the query, and print the respective file paths.
import argparse
import csv
from fuzzywuzzy import fuzz
import sys
def makeCsvFieldLimitHuge():
"""The csv module has a fixed limit on field sizes. Fix that."""
limit = sys.maxsize
while True:
except OverflowError:
limit = int(limit/2)
def parseArgs():
"""Define the options, parse the command line, and return the options object."""
optionsParser = argparse.ArgumentParser()
optionsParser.add_argument('-l', '--library', type = str, nargs = 1, default = 'articles.csv', help = "specify the library to search")
optionsParser.add_argument('query', type = str, nargs='*', default = 'South Kensington, London', help="strings to search in the library")
return optionsParser.parse_args()
def readArticles(path: str) -> list:
"""Read the library file."""
with open(path, 'r') as csvfile:
return [ f for f in csv.reader(csvfile) ]
def query(articles: list, search: str):
"""Search all the indexed documents for the given words, sort them by how well they match the search, and list all documents that score at least 30%."""
ratio = [ (fuzz.token_set_ratio(f[1], search), f[0]) for f in articles ]
for x in ratio:
if x[0] >= 30:
print("Match: %d%% in URL: %s" %(x))
def main():
options = parseArgs()
query(readArticles(options.library), options.query)

crawler/fetch-data Executable file
View File

@ -0,0 +1,39 @@
#! /usr/bin/env bash
# fetch-data
# Crawl a number of HPC related sites.
# The sites are downloaded to a directory called "data", which also contains the respective log files from the downloads.
# The sites to download are listed at the end of this script.
wgetFlags="-r -N -k --random-wait --no-parent --adjust-extension --reject=.pdf --follow-tags=a"
# --follow-tags=a
# crawlSite <url> <logfile>
# Download the site at <url> into the directory "data", writing the wget output to <logfile>.
function crawlSite() {
local baseUrl="$1"
local logFile="$2"
echo "fetching data from $baseUrl..."
wget $wgetFlags -o "$logFile" "$baseUrl"
local result=$?
if ((result)) ; then
echo "wget exited with error code $result, see $logFile for details"
mkdir -p "$dataDir"
cd "$dataDir"
#XXX: Add sites to crawl here:
crawlSite dkrz.log
crawlSite hlrs.log
crawlSite slurm.log
crawlSite llnl.log
crawlSite netcdf.log

crawler/index-data Executable file
View File

@ -0,0 +1,65 @@
#!/usr/bin/env python3
# index-data
# Walk the sites stored under the "data" directory, and build two archives containing the name and contents of their text containing files (.txt, .html, .xml).
# This creates two archive files:
# * A CSV file with two columns.
# The first column gives the path of each file, while the seconds column gives its respective contents.
# * A newline-delimited JSON file.
# This provides the same data in a form, that can hopefully be directly imported into an elastic search database using the _bulk API endpoint.
# The action lines contain only an empty "index" object, relying on the "_index" to be provided in the request path, the "_id" field is assumed to be assigned randomly.
# The source lines contain an object of the form
# {
# "path" : <path of the text file>
# "content" : <text content with newlines replaced by spaces>
# }
import csv
import html2text
import json
import os
import re
kBaseDir = "data"
kIncludePatterns = [ #regex patterns for the file paths to include in the archive
kCookedIncludeRegex = re.compile("(?:" + ")|(?:".join(kIncludePatterns) + ")")
#The base directory is expected to contain both downloaded sites contained in directories and download log files.
#We want to walk all the directories containing the data, and ignore the log files.
#Get the list of directories.
directories = next(os.walk(kBaseDir))[1]
#Walk the directory hierarchy and build a list of files that match one of the include patterns.
files = []
for dirName in directories:
print("scanning " + kBaseDir + "/" + dirName + "...")
for r, d, f in os.walk(kBaseDir + "/" + dirName):
for file in f:
path = os.path.join(r, file)
#Open the files one by one, convert them into plain text, and concatenate their contents into a CSV file.
with open("articles.csv", "w") as of:
o = csv.writer(of)
with open("elastic-import.ndjson", "w") as jsonFile:
actionJson = '{ "index" : {} }\n'
for f in files:
with open(f) as file:
data = html2text.html2text("\n", " ")
o.writerow([f, data])
jsonObject = { 'path': f, 'content': data }
jsonString = json.dumps(jsonObject, separators = (',', ':')) + "\n"