Steps: - [x] pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang - [x] import pseudo crawled dataset on JZ @SaulLu - [x] run 1st step of extraction: 1. Extract text, HTML head sections, HTML footer sections, HTML Titles section and HTML metadata @SaulLu 2. Change format of URL @SaulLu 3. Extract Timestamp @cccntu @SaulLu 4. Extract Generation Length Sentence @chkla @SaulLu 5. Extract Generation Length Text @chkla @SaulLu 6. Extract Data source @chkla @SaulLu - [x] run 2nd step of extraction: 1. Extract Website descriptions @shanyas10 @SaulLu - [x] run 3rd step of extraction: 1. Extract Entities @manandey @SaulLu 2. (option) Extract Entities descriptions @manandey @SaulLu - [x] run 4th step of extraction: 1. Extract Paragraph @tianjianjiang @SaulLu * #114 * #125 * annotator (preprocessor) of the metadata 2. Modify entities metadata with paragraph information @manandey @SaulLu 3. Modify generation length with paragraph information @chkla @SaulLu - [ ] (optional) clean final dataset: 1. Remove empty lines @SaulLu 2. Remove "errors" columns @SaulLu 3. (optional) Gather all metadata into same column @cccntu @timoschick @SaulLu - [ ] push dataset to Hub @SaulLu
Steps: