125 lines
7.5 KiB
Markdown
125 lines
7.5 KiB
Markdown
|
Webcrawler and search for some HPC material
|
||
|
===========================================
|
||
|
|
||
|
Using the webcrawler
|
||
|
--------------------
|
||
|
|
||
|
There are three scripts in this directory, which are supposed to be run in sequence:
|
||
|
|
||
|
$ ./fetch-data
|
||
|
$ ./index-data
|
||
|
$ ./analysis.py some interesting query
|
||
|
|
||
|
###Step 1: Fetch the data
|
||
|
The first script is a web crawler that downloads "interesting" stuff from some HPC related websites.
|
||
|
|
||
|
###Step 2: Process the data
|
||
|
The second script walks the file hierarchy created by the first script and turns the HTML and XML markup into pure text.
|
||
|
From this data, it builds to archive files, one is a two-column CSV file (path and content),
|
||
|
the other is meant to be used as an input file for an elasticsearch database.
|
||
|
See the script comment in `index-data` for details.
|
||
|
|
||
|
###Step 3: Query the data
|
||
|
The third scrip uses Python's fuzzywuzzy module to match a given query to the contents of the archive file produced by the second script.
|
||
|
|
||
|
|
||
|
Performance of the search
|
||
|
-------------------------
|
||
|
|
||
|
I am not impressed by it.
|
||
|
Sorry, but I really cannot sell this as a success story.
|
||
|
|
||
|
The problems that I see:
|
||
|
|
||
|
* The match percentage depends only on the number of query words that are found in the document.
|
||
|
As such, the result list does not distinguish between a document that uses the query words once and one that uses them over and over again.
|
||
|
|
||
|
* The match percentage does not take into account whether the query words appear close together or not.
|
||
|
|
||
|
* It is completely ignored *where* in the text the query words appear.
|
||
|
A page that lists for example "Python" in its title is scored equal to a page that contains a link with the tool-tip "a small python script to frobnicate foos".
|
||
|
Both will be listed as 100 percent matches to the query "python".
|
||
|
|
||
|
These problems severely limit the usefulness of the query feature, as shown in the examples below.
|
||
|
|
||
|
###Some example queries
|
||
|
|
||
|
####Searching for a Python introduction
|
||
|
|
||
|
$ ./analysis.py python introduction | head
|
||
|
Match: 100% in URL: data/www.unidata.ucar.edu/software/netcdf/docs/netcdf_introduction.html
|
||
|
Match: 100% in URL: data/www.unidata.ucar.edu/software/netcdf/docs/faq.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/projects-and-cooperations/ipcc-data/order-ipcc-data-on-dvd/ipcc-ddc-data-format-information.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/projects-and-cooperations/cops/example-files/switchLanguage?set_language=en.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/projects-and-cooperations/cops/example-files.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/esgf-services-1/esgf-preparation/switchLanguage?set_language=en.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/esgf-services-1/esgf-preparation.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/cmip-data-pool/switchLanguage?set_language=en.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/data-management/cmip-data-pool.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/analysis/visualization/sw/vapor/vapor/switchLanguage?set_language=en.html
|
||
|
|
||
|
The first document matches because it contains the sentence "C-based 3rd-party netCDF APIs for other languages include Python, Ruby, Perl, Fortran-2003, MATLAB, IDL, and R".
|
||
|
The document is an introduction, all right, but for NetCDF, not Python.
|
||
|
Likewise, the FAQ (second document) mentions several times that NetCDF can be used with Python, and it contains two links that lead to "introduction" documents.
|
||
|
|
||
|
The third link is even more obscure.
|
||
|
Funnily enough, it mentions the NetCDF Python library again (once), but the word "introduction" does not even appear in the rendered HTML document.
|
||
|
It is necessary to load the page's source HTML code to find out that there is a link contained within that page that has the tool-tip
|
||
|
"Short introduction to the OpenStack Swift Storage system",
|
||
|
and a second link to https://www.dkrz.de/up/my-dkrz/getting-started which is praised in the tool-tip as leading to "a short introduction".
|
||
|
|
||
|
Googling for "python introduction" yields much better results.
|
||
|
|
||
|
####Trying to solve a batch processing problem
|
||
|
|
||
|
$ ./analysis.py batch job not starting | head
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/code-tuning/debugging/switchLanguage?set_language=en.html
|
||
|
Match: 100% in URL: data/www.dkrz.de/up/services/code-tuning/debugging.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/srun.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/squeue.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/slurm.conf.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/scontrol.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/sbatch.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/sacct.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/reservations.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/quickstart_admin.html
|
||
|
|
||
|
The first result describes debugging with ARM DDT, not how to troubleshoot problems with batch jobs.
|
||
|
The second result is actually the same page as the first.
|
||
|
The third and seventh results are actually somewhat useful, they are the online version of `man srun` and `man sbatch`.
|
||
|
The other results are not useful, as they are just further man pages of the other slurm commands,
|
||
|
and have little information to give on troubleshooting jobs that won't start.
|
||
|
At least, the slurm.schedmd.com links point the user to the correct software.
|
||
|
|
||
|
Google, with the same query, does not fare any better, as its results are dominated by Microsoft's Dynamics AX software.
|
||
|
|
||
|
####Trying to run a program on several nodes
|
||
|
|
||
|
$ ./analysis.py run program on several nodes in parallel | head
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/srun.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/quickstart.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/programmer_guide.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/faq.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/download.html
|
||
|
Match: 100% in URL: data/slurm.schedmd.com/acct_gather_profile_plugins.html
|
||
|
Match: 100% in URL: data/kb.hlrs.de/platforms/index.php/Open_MPI.html
|
||
|
Match: 100% in URL: data/kb.hlrs.de/platforms/index.php/NEC_Cluster_cacau_introduction.html
|
||
|
Match: 100% in URL: data/kb.hlrs.de/platforms/index.php/NEC_Cluster_access_(vulcan).html
|
||
|
Match: 100% in URL: data/kb.hlrs.de/platforms/index.php/CRAY_XE6_notes_for_the_upgraded_Batch_System.html
|
||
|
|
||
|
Finally a success, the second result provides the required informations.
|
||
|
|
||
|
Google provides a wide variety of more or less helpful links, some of which are significantly better geared towards people without a solid education in HPC.
|
||
|
|
||
|
###Summary
|
||
|
|
||
|
The more specialized the queries were, the better the results of `./analysis.py` became.
|
||
|
However, the insensitivity of our algorithm to the locations of the matches and their number, frequently allows entirely unrelated results to float to the top.
|
||
|
Google does not suffer from this problem.
|
||
|
Google only defeats itself whenever there is a major non-HPC technology/interpretation/thing that dominates its result list, pushing the useful results out of sight.
|
||
|
|
||
|
The most important improvement would be to weight in whether a match occurs within a link or its tool-tip.
|
||
|
The next important improvement would be to weight where the match occurs within the text (title/introductory paragraphs/body/footnotes).
|
||
|
The third important improvement would be to weight in whether the matches occur in close proximity or not.
|
||
|
The fourth important improvement would be to consider the amount of matches (passing its relative frequency through a log() function or similar).
|