Changes

Gerald Haesendonck · c89c5f96
--- a/home.md
+++ b/home.md
+# retrieving the index data from Common Crawl
+
+Install [cdx-index-client](https://github.com/ikreymer/cdx-index-client), preferrably in a python virtual environment.
+
+Obain index files (e.g. all `.be` URIs from the June crawl and save them in a directory `201806`):
+
+```
+./cdx-index-client.py -c CC-MAIN-2018-26 -z -d 201806 *.be
+```
+
+How to obtain the `CC-MAIN-2018-26` index server instance, is explained on <http://index.commoncrawl.org/>
+
+This results in a lot of gzipped files in the directory `201806`.
+
+The next step is compressing them per year in `xz` files:
+
+```
+zcat 201806/* | pixz > 201806.xz
+```
+
+# preparing data
+
+The data to start with are `CDXJ` files obtained from Commoncrawl, compressed using `xz`.
+In this case the file names are formatted <YYYYmm>.xz, e.g. `201806.xz` for June 2018.
+
+The records need to be sorted before any processing can take place:
+
+`01_sort_all.sh`:
+
+```
+xzcat 2018* | tr '[:upper:]' '[:lower:]' | sort -T tmp -S 10g | pixz > all_sorted.xz
+```
+(the -T option tells `sort` to use a temporary dir `tmp` in the current dir in stead of
+the sysrtem default. This is just because on the processing server `/tmp` was on a small partition.)
+
+This command converts all entries to lower case (e.g. `http://www.example.com/TEST` -> `http://www.example.com/test`; then sorts all entries (without removing duplicates!) and stores the result in `all_sorted.xz`.
+