|
|
These are the steps taken to compute some statistics on [Common Crawl](http://commoncrawl.org/) data. Results are included for data from January 2018 up until July 2018.
|
|
|
|
|
|
# retrieving the index data from Common Crawl
|
|
|
|
|
|
Install [cdx-index-client](https://github.com/ikreymer/cdx-index-client), preferrably in a python virtual environment.
|
... | ... | @@ -50,3 +52,15 @@ Note that this works on SURTS, which means that `www.example.com` and `example.c |
|
|
### 1.1 Number of full domains
|
|
|
Counts the number of full domains. Full domains are TLD's as well as subdomains. e.g. `example.com` and `test.example.com` are considered as different.
|
|
|
|
|
|
```
|
|
|
xzcat all_domains_sorted.xz | wc -l
|
|
|
```
|
|
|
**=> 647631**
|
|
|
|
|
|
### 1.2 Number of TLDs
|
|
|
The difference with 1.1 is that this only counts the top level domains, e.g. `example.com` and not `test.example.com`.
|
|
|
|
|
|
```
|
|
|
xzcat all_domains_sorted.xz | cut -d ',' -f 1,2 | uniq | wc -l
|
|
|
```
|
|
|
**=> 507704** |
|
|
\ No newline at end of file |