Create Dataset with metadata

Steps:
- [x] pseudo crawl ~10% of C4 web page from Common Crawl @tianjianjiang 
- [x] import pseudo crawled dataset on JZ @SaulLu 
- [x] run 1st step of extraction:
  1. Extract text, HTML head sections, HTML footer sections, HTML Titles section and HTML metadata @SaulLu 
  2. Change format of URL @SaulLu 
  3. Extract Timestamp @cccntu @SaulLu 
  4. Extract Generation Length Sentence @chkla @SaulLu 
  5. Extract Generation Length Text  @chkla @SaulLu 
  6. Extract Data source  @chkla @SaulLu 
- [x] run 2nd step of extraction:
  1. Extract Website descriptions @shanyas10 @SaulLu 
- [x] run 3rd step of extraction:
  1. Extract Entities @manandey @SaulLu 
  2. (option) Extract Entities descriptions @manandey @SaulLu 
- [x] run 4th step of extraction: 
  1. Extract Paragraph @tianjianjiang @SaulLu 
      * #114 
      * #125
      * annotator (preprocessor) of the metadata
  2. Modify entities metadata with paragraph information @manandey @SaulLu 
  3. Modify generation length with  paragraph information @chkla @SaulLu 
- [ ] (optional) clean final dataset:
  1. Remove empty lines @SaulLu 
  2. Remove "errors" columns @SaulLu 
  3. (optional) Gather all metadata into same column @cccntu @timoschick @SaulLu 
- [ ] push dataset to Hub @SaulLu 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create Dataset with metadata #124

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Create Dataset with metadata #124

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions