Changes

Gerald Haesendonck · 08af6f01
--- a/home.md
+++ b/home.md
@@ -20,7 +20,7 @@ The next step is compressing them per year in `xz` files:
 zcat 201806/* | pixz > 201806.xz
 ```

-# preparing data
+# preparing the data

 The data to start with are `CDXJ` files obtained from Commoncrawl, compressed using `xz`.
 In this case the file names are formatted <YYYYmm>.xz, e.g. `201806.xz` for June 2018.
@@ -37,6 +37,12 @@ the sysrtem default. This is just because on the processing server `/tmp` was on

 This command converts all entries to lower case (e.g. `http://www.example.com/TEST` -> `http://www.example.com/test`; then sorts all entries (without removing duplicates!) and stores the result in `all_sorted.xz`.

+To make some analysis on 'real' data, let's keep only records that originate from a 200 OK (so no redirects, errors, ...) and no robots.txt. And only the latest entry in case of duplicates.
+
+```
+xzcat all_sorted.xz | ./cc-index-tools dedup | pixz > all_sorted_deduplicated.xz
+```
+
 # Some simple statistics

 ## 1. Domains
@@ -45,6 +51,7 @@ To speed up counts on domains, it might be worth extracting the full domains fir

 ```
 xzcat all_sorted.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted.xz
+xzcat all_sorted_deduplicated.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted_deduplicated.xz
 ```

 Note that this works on SURTS, which means that `www.example.com` and `example.com` are both converted to `com.example` and thus considered equal.