AWS Free Datasets: Part 1 by Max Vapor Pump (MTXVP)

AWS Free Datasets: Part 1

Jan 24, 2024 | mtxvp

dataset opendata

AWS Open Data Sponsorship Program hosts number of datasets on S3 that are free to download and use in your next ML or Data Science projects.

It is always a good practice to check data set licenses and related documentation to determine if a data set may be used for you application.

Let's look at 4 datasets out of that list:

📓 Common Crawl

A corpus of web crawl data composed of over 50 billion web pages.

Licence: Common Crawl Terms of Use
Update frequency: monthly
Documentation

📕 Amazon Berkeley Objects Dataset

Collection of 147,702 product listings with multilingual metadata and 398,212 unique catalog images. 8,222 listings come with turntable photography (also referred as "spin" or "360º-View" images), as sequences of 24 or 72 images, for a total of 586,584 images in 8,209 unique sequences. For 7,953 products, the collection also provides high-quality 3d models, as glTF 2.0 files.

Licence: CC BY-NC 4.0
Update frequency: not updated
Documentation

📘 BodyM Dataset

Large public body measurement dataset for 2,505 real subjects, paired with height, weight and 14 body measurements, including height, weight, gender, silhouette images of subject and more.

Licence: Creative Commons Attribution-Non Commercial 4.0 International Public License
Update frequency: not updated
Documentation

📗 Google Books Ngrams

N-grams are fixed size tuples of items. In this case the items are words extracted from the Google Books corpus. The n specifies the number of elements in the tuple, so a 5-gram contains five words or characters. The n-grams in this dataset were produced by passing a sliding window of the text of books and outputting a record for each new token.

Licence: Creative Commons Attribution 3.0 Unported License
Update frequency: not updated
Documentation