... | ... | @@ -32,11 +32,11 @@ The records need to be sorted before any processing can take place: |
|
|
```
|
|
|
xzcat 2018* | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort -T tmp -S 10g | pixz > all_sorted.xz
|
|
|
```
|
|
|
(the -T option tells `sort` to use a temporary dir `tmp` in the current dir in stead of
|
|
|
the sysrtem default. This is just because on the processing server `/tmp` was on a small partition.)
|
|
|
|
|
|
This command converts all entries to lower case (e.g. `http://www.example.com/TEST` -> `http://www.example.com/test`; then sorts all entries (without removing duplicates!) and stores the result in `all_sorted.xz`.
|
|
|
|
|
|
The -T option tells `sort` to use a temporary dir `tmp` in the current dir in stead of
|
|
|
the sysrtem default. This is just because on the processing server `/tmp` was on a small partition. `LC_ALL=C` guarantees byte order sort. This is important because sorting according to a locale can lead to duplicates after cutting.
|
|
|
|
|
|
To make some analysis on 'real' data, let's keep only records that originate from a 200 OK (so no redirects, errors, ...) and no robots.txt. And only the latest entry in case of duplicates.
|
|
|
|
|
|
```
|
... | ... | @@ -50,8 +50,8 @@ xzcat all_sorted.xz | ./cc-index-tools dedup | pixz > all_sorted_deduplicated.xz |
|
|
To speed up counts on domains, it might be worth extracting the full domains first:
|
|
|
|
|
|
```
|
|
|
xzcat all_sorted.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted.xz
|
|
|
xzcat all_sorted_deduplicated.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted_deduplicated.xz
|
|
|
xzcat all_sorted.xz | cut -d ')' -f 1 | uniq | pixz > all_domains_sorted.xz
|
|
|
xzcat all_sorted_deduplicated.xz | cut -d ')' -f 1 | uniq | pixz > all_domains_sorted_deduplicated.xz
|
|
|
```
|
|
|
|
|
|
Note that this works on SURTS, which means that `www.example.com` and `example.com` are both converted to `com.example` and thus considered equal.
|
... | ... | |