Webcrawler and search for some HPC material

Using the webcrawler

There are three scripts in this directory, which are supposed to be run in sequence:

$ ./fetch-data
$ ./index-data
$ ./analysis.py some interesting query

###Step 1: Fetch the data The first script is a web crawler that downloads "interesting" stuff from some HPC related websites.

###Step 2: Process the data The second script walks the file hierarchy created by the first script and turns the HTML and XML markup into pure text. From this data, it builds to archive files, one is a two-column CSV file (path and content), the other is meant to be used as an input file for an elasticsearch database. See the script comment in index-data for details.

###Step 3: Query the data The third scrip uses Python's fuzzywuzzy module to match a given query to the contents of the archive file produced by the second script.

Performance of the search

I am not impressed by it. Sorry, but I really cannot sell this as a success story.

The problems that I see:

The match percentage depends only on the number of query words that are found in the document. As such, the result list does not distinguish between a document that uses the query words once and one that uses them over and over again.
The match percentage does not take into account whether the query words appear close together or not.
It is completely ignored where in the text the query words appear. A page that lists for example "Python" in its title is scored equal to a page that contains a link with the tool-tip "a small python script to frobnicate foos". Both will be listed as 100 percent matches to the query "python".

These problems severely limit the usefulness of the query feature, as shown in the examples below.

###Some example queries

####Searching for a Python introduction

$ ./analysis.py python introduction | head
Match: 100% in URL: data/www.unidata.ucar.edu/software/netcdf/docs/netcdf_introduction.html
Match: 100% in URL: data/www.unidata.ucar.edu/software/netcdf/docs/faq.html
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/projects-and-cooperations/ipcc-data/order-ipcc-data-on-dvd/ipcc-ddc-data-format-information.html
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/projects-and-cooperations/cops/example-files/switchLanguage?set_language=en.html
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/projects-and-cooperations/cops/example-files.html
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/esgf-services-1/esgf-preparation/switchLanguage?set_language=en.html
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/esgf-services-1/esgf-preparation.html
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/cmip-data-pool/switchLanguage?set_language=en.html
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/cmip-data-pool.html
Match: 100% in URL: data/www.dkrz.de/up/services/analysis/visualization/sw/vapor/vapor/switchLanguage?set_language=en.html

The first document matches because it contains the sentence "C-based 3rd-party netCDF APIs for other languages include Python, Ruby, Perl, Fortran-2003, MATLAB, IDL, and R". The document is an introduction, all right, but for NetCDF, not Python. Likewise, the FAQ (second document) mentions several times that NetCDF can be used with Python, and it contains two links that lead to "introduction" documents.

The third link is even more obscure. Funnily enough, it mentions the NetCDF Python library again (once), but the word "introduction" does not even appear in the rendered HTML document. It is necessary to load the page's source HTML code to find out that there is a link contained within that page that has the tool-tip "Short introduction to the OpenStack Swift Storage system", and a second link to https://www.dkrz.de/up/my-dkrz/getting-started which is praised in the tool-tip as leading to "a short introduction".

Googling for "python introduction" yields much better results.

####Trying to solve a batch processing problem

$ ./analysis.py batch job not starting | head
Match: 100% in URL: data/www.dkrz.de/up/services/code-tuning/debugging/switchLanguage?set_language=en.html
Match: 100% in URL: data/www.dkrz.de/up/services/code-tuning/debugging.html
Match: 100% in URL: data/slurm.schedmd.com/srun.html
Match: 100% in URL: data/slurm.schedmd.com/squeue.html
Match: 100% in URL: data/slurm.schedmd.com/slurm.conf.html
Match: 100% in URL: data/slurm.schedmd.com/scontrol.html
Match: 100% in URL: data/slurm.schedmd.com/sbatch.html
Match: 100% in URL: data/slurm.schedmd.com/sacct.html
Match: 100% in URL: data/slurm.schedmd.com/reservations.html
Match: 100% in URL: data/slurm.schedmd.com/quickstart_admin.html

The first result describes debugging with ARM DDT, not how to troubleshoot problems with batch jobs. The second result is actually the same page as the first. The third and seventh results are actually somewhat useful, they are the online version of man srun and man sbatch. The other results are not useful, as they are just further man pages of the other slurm commands, and have little information to give on troubleshooting jobs that won't start. At least, the slurm.schedmd.com links point the user to the correct software.

Google, with the same query, does not fare any better, as its results are dominated by Microsoft's Dynamics AX software.

####Trying to run a program on several nodes

$ ./analysis.py run program on several nodes in parallel | head
Match: 100% in URL: data/slurm.schedmd.com/srun.html
Match: 100% in URL: data/slurm.schedmd.com/quickstart.html
Match: 100% in URL: data/slurm.schedmd.com/programmer_guide.html
Match: 100% in URL: data/slurm.schedmd.com/faq.html
Match: 100% in URL: data/slurm.schedmd.com/download.html
Match: 100% in URL: data/slurm.schedmd.com/acct_gather_profile_plugins.html
Match: 100% in URL: data/kb.hlrs.de/platforms/index.php/Open_MPI.html
Match: 100% in URL: data/kb.hlrs.de/platforms/index.php/NEC_Cluster_cacau_introduction.html
Match: 100% in URL: data/kb.hlrs.de/platforms/index.php/NEC_Cluster_access_(vulcan).html
Match: 100% in URL: data/kb.hlrs.de/platforms/index.php/CRAY_XE6_notes_for_the_upgraded_Batch_System.html

Finally a success, the second result provides the required informations.

Google provides a wide variety of more or less helpful links, some of which are significantly better geared towards people without a solid education in HPC.

###Summary

The more specialized the queries were, the better the results of ./analysis.py became. However, the insensitivity of our algorithm to the locations of the matches and their number, frequently allows entirely unrelated results to float to the top. Google does not suffer from this problem. Google only defeats itself whenever there is a major non-HPC technology/interpretation/thing that dominates its result list, pushing the useful results out of sight.

The most important improvement would be to weight in whether a match occurs within a link or its tool-tip. The next important improvement would be to weight where the match occurs within the text (title/introductory paragraphs/body/footnotes). The third important improvement would be to weight in whether the matches occur in close proximity or not. The fourth important improvement would be to consider the amount of matches (passing its relative frequency through a log() function or similar).