|
|
# retrieving the index data from Common Crawl
|
|
|
|
|
|
Install [cdx-index-client](https://github.com/ikreymer/cdx-index-client), preferrably in a python virtual environment.
|
|
|
|
|
|
Obain index files (e.g. all `.be` URIs from the June crawl and save them in a directory `201806`):
|
|
|
|
|
|
```
|
|
|
./cdx-index-client.py -c CC-MAIN-2018-26 -z -d 201806 *.be
|
|
|
```
|
|
|
|
|
|
How to obtain the `CC-MAIN-2018-26` index server instance, is explained on <http://index.commoncrawl.org/>
|
|
|
|
|
|
This results in a lot of gzipped files in the directory `201806`.
|
|
|
|
|
|
The next step is compressing them per year in `xz` files:
|
|
|
|
|
|
```
|
|
|
zcat 201806/* | pixz > 201806.xz
|
|
|
```
|
|
|
|
|
|
# preparing data
|
|
|
|
|
|
The data to start with are `CDXJ` files obtained from Commoncrawl, compressed using `xz`.
|
|
|
In this case the file names are formatted <YYYYmm>.xz, e.g. `201806.xz` for June 2018.
|
|
|
|
|
|
The records need to be sorted before any processing can take place:
|
|
|
|
|
|
`01_sort_all.sh`:
|
|
|
|
|
|
```
|
|
|
xzcat 2018* | tr '[:upper:]' '[:lower:]' | sort -T tmp -S 10g | pixz > all_sorted.xz
|
|
|
```
|
|
|
(the -T option tells `sort` to use a temporary dir `tmp` in the current dir in stead of
|
|
|
the sysrtem default. This is just because on the processing server `/tmp` was on a small partition.)
|
|
|
|
|
|
This command converts all entries to lower case (e.g. `http://www.example.com/TEST` -> `http://www.example.com/test`; then sorts all entries (without removing duplicates!) and stores the result in `all_sorted.xz`.
|
|
|
|