... | ... | @@ -35,3 +35,18 @@ the sysrtem default. This is just because on the processing server `/tmp` was on |
|
|
|
|
|
This command converts all entries to lower case (e.g. `http://www.example.com/TEST` -> `http://www.example.com/test`; then sorts all entries (without removing duplicates!) and stores the result in `all_sorted.xz`.
|
|
|
|
|
|
# Some simple statistics
|
|
|
|
|
|
## 1. Domains
|
|
|
|
|
|
To speed up counts on domains, it might be worth extracting the full domains first:
|
|
|
|
|
|
```
|
|
|
xzcat all_sorted.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted.xz
|
|
|
```
|
|
|
|
|
|
Note that this works on SURTS, which means that `www.example.com` and `example.com` are both converted to `com.example` and thus considered equal.
|
|
|
|
|
|
### 1.1 Number of full domains
|
|
|
Counts the number of full domains. Full domains are TLD's as well as subdomains. e.g. `example.com` and `test.example.com` are considered as different.
|
|
|
|