... | ... | @@ -20,7 +20,7 @@ The next step is compressing them per year in `xz` files: |
|
|
zcat 201806/* | pixz > 201806.xz
|
|
|
```
|
|
|
|
|
|
# preparing data
|
|
|
# preparing the data
|
|
|
|
|
|
The data to start with are `CDXJ` files obtained from Commoncrawl, compressed using `xz`.
|
|
|
In this case the file names are formatted <YYYYmm>.xz, e.g. `201806.xz` for June 2018.
|
... | ... | @@ -37,6 +37,12 @@ the sysrtem default. This is just because on the processing server `/tmp` was on |
|
|
|
|
|
This command converts all entries to lower case (e.g. `http://www.example.com/TEST` -> `http://www.example.com/test`; then sorts all entries (without removing duplicates!) and stores the result in `all_sorted.xz`.
|
|
|
|
|
|
To make some analysis on 'real' data, let's keep only records that originate from a 200 OK (so no redirects, errors, ...) and no robots.txt. And only the latest entry in case of duplicates.
|
|
|
|
|
|
```
|
|
|
xzcat all_sorted.xz | ./cc-index-tools dedup | pixz > all_sorted_deduplicated.xz
|
|
|
```
|
|
|
|
|
|
# Some simple statistics
|
|
|
|
|
|
## 1. Domains
|
... | ... | @@ -45,6 +51,7 @@ To speed up counts on domains, it might be worth extracting the full domains fir |
|
|
|
|
|
```
|
|
|
xzcat all_sorted.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted.xz
|
|
|
xzcat all_sorted_deduplicated.xz | cut -d ')' -f 1 | uniq | sort -S 8g -u | pixz > all_domains_sorted_deduplicated.xz
|
|
|
```
|
|
|
|
|
|
Note that this works on SURTS, which means that `www.example.com` and `example.com` are both converted to `com.example` and thus considered equal.
|
... | ... | |