AWS Free Datasets: Part 1
Jan 24, 2024 by mtxvp
AWS Open Data Sponsorship Program hosts number of datasets on S3 that are free to download and use in your next ML or Data Science projects.
It is always a good practice to check data set licenses and related documentation to determine if a data set may be used for you application.
Let's look at 4 datasets out of that list:
π Common Crawl
A corpus of web crawl data composed of over 50 billion web pages.
- Licence: Common Crawl Terms of Use
- Update frequency: monthly
- Documentation
π Amazon Berkeley Objects Dataset
Collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. 8,222 listings come with turntable photography (also referred as "spin" or "360ΒΊ-View" images), as sequences of 24 or 72 images, for a total of 586,584 images in 8,209 unique sequences. For 7,953 products, the collection also provides high-quality 3d models, as glTF 2.0 files.
- Licence: CC BY-NC 4.0
- Update frequency: not updated
- Documentation
π BodyM Dataset
Large public body measurement dataset for 2,505 real subjects, paired with height, weight and 14 body measurements, including height, weight, gender, silhouette images of subject and more.
- Licence: Creative Commons Attribution-Non Commercial 4.0 International Public License
- Update frequency: not updated
- Documentation
π Google Books Ngrams
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.
- Licence: Creative Commons Attribution 3.0 Unported License
- Update frequency: not updated
- Documentation