AWS Free Datasets: Part 1
Jan 24, 2024 | mtxvp
AWS Open Data Sponsorship Program hosts number of datasets on S3 that are free to download and use in your next ML or Data Science projects.
It is always a good practice to check data set licenses and related documentation to determine if a data set may be used for you application.
Let's look at 4 datasets out of that list:
π Common Crawl
A corpus of web crawl data composed of over 50 billion web pages.
- Licence: Common Crawl Terms of Use
- Update frequency: monthly
- Documentation
π Amazon Berkeley Objects Dataset
Collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. 8,222 listings come with turntable photography (also referred as "spin" or "360ΒΊ-View" images), as sequences of 24 or 72 images, for a total of 586,584 images in 8,209 unique sequences. For 7,953 products, the collection also provides high-quality 3d models, as glTF 2.0 files.
- Licence: CC BY-NC 4.0
- Update frequency: not updated
- Documentation
π BodyM Dataset
Large public body measurement dataset for 2,505 real subjects, paired with height, weight and 14 body measurements, including height, weight, gender, silhouette images of subject and more.
- Licence: Creative Commons Attribution-Non Commercial 4.0 International Public License
- Update frequency: not updated
- Documentation
π Google Books Ngrams
N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.
- Licence: Creative Commons Attribution 3.0 Unported License
- Update frequency: not updated
- Documentation