... | ... | @@ -30,7 +30,7 @@ The records need to be sorted before any processing can take place: |
|
|
`01_sort_all.sh`:
|
|
|
|
|
|
```
|
|
|
xzcat 2018* | tr '[:upper:]' '[:lower:]' | sort -T tmp -S 10g | pixz > all_sorted.xz
|
|
|
xzcat 2018* | tr '[:upper:]' '[:lower:]' | LC_ALL=C sort -T tmp -S 10g | pixz > all_sorted.xz
|
|
|
```
|
|
|
(the -T option tells `sort` to use a temporary dir `tmp` in the current dir in stead of
|
|
|
the sysrtem default. This is just because on the processing server `/tmp` was on a small partition.)
|
... | ... | @@ -64,6 +64,11 @@ xzcat all_domains_sorted.xz | wc -l |
|
|
```
|
|
|
**=> 647631**
|
|
|
|
|
|
```
|
|
|
xzcat all_domains_sorted_deduplicated.xz | wc -l
|
|
|
```
|
|
|
**=> **
|
|
|
|
|
|
### 1.2 Number of TLDs
|
|
|
The difference with 1.1 is that this only counts the top level domains, e.g. `example.com` and not `test.example.com`.
|
|
|
|
... | ... | |