Changes

Gerald Haesendonck · fc03ad6b
--- a/home.md
+++ b/home.md
@@ -32,11 +32,11 @@ The records need to be sorted before any processing can take place:
 ```
 xzcat 2018* | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort -T tmp -S 10g | pixz > all_sorted.xz
 ```
-(the -T option tells `sort` to use a temporary dir `tmp` in the current dir in stead of
-the sysrtem default. This is just because on the processing server `/tmp` was on a small partition.)
-
 This command converts all entries to lower case (e.g. `http://www.example.com/TEST` -> `http://www.example.com/test`; then sorts all entries (without removing duplicates!) and stores the result in `all_sorted.xz`.

+The -T option tells `sort` to use a temporary dir `tmp` in the current dir in stead of
+the sysrtem default. This is just because on the processing server `/tmp` was on a small partition. `LC_ALL=C` guarantees byte order sort. This is important because sorting according to a locale can lead to duplicates after cutting.
+
 To make some analysis on 'real' data, let's keep only records that originate from a 200 OK (so no redirects, errors, ...) and no robots.txt. And only the latest entry in case of duplicates.

 ```
@@ -50,8 +50,8 @@ xzcat all_sorted.xz | ./cc-index-tools dedup | pixz > all_sorted_deduplicated.xz
 To speed up counts on domains, it might be worth extracting the full domains first:

 ```
-xzcat all_sorted.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted.xz
-xzcat all_sorted_deduplicated.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted_deduplicated.xz
+xzcat all_sorted.xz | cut -d ')' -f 1 | uniq | pixz > all_domains_sorted.xz
+xzcat all_sorted_deduplicated.xz | cut -d ')' -f 1 | uniq | pixz > all_domains_sorted_deduplicated.xz
 ```

 Note that this works on SURTS, which means that `www.example.com` and `example.com` are both converted to `com.example` and thus considered equal.