README.md

Etsy dataset

release 1.0 (Aug 22, 2016)

The dataset contains product metadata from the online market Etsy, such as title, description, tags, materials, or image URLs. The dataset contains 2.8 million product listings sold in Sep 2014.

The use of the dataset is for academic research only.

Files

Categories

categories.json (1.9M)

Product category is organized in tree structure. The JSON file contains a list of nodes in the category tree. The format look like the following.

{
  "category_id": 68887312,
  "name": "art",
  "meta_title": "Fine Art on Etsy - Original fine art, prints, sculpture ",
  "meta_keywords": "handmade art, handcrafted art, folk art, arts and crafts, fine art, painting, original art, sculpture, prints, art reproductions, collectible art, ACEO",
  "meta_description": "Shop for unique, original fine art directly from independent artists on Etsy, a global handmade marketplace. Browse paintings, sculpture, digital prints, reproductions & more.",
  "page_description": "Shop for unique, original fine art from our artisan community",
  "page_title": "Fine Art",
  "category_name": "art",
  "short_name": "Art",
  "long_name": "Art",
  "num_children": 12,
  "parent": null,
  "children": [
    68888532,
    68888582,
    68890894,
    68890908,
    68891802,
    68891842,
    68891942,
    68891952,
    68892154,
    69154725,
    69154867,
    69154933
  ]
}

A category node contains some fields.

  • name: name of the category.
  • category_id: id of the category node.
  • parent: id of the parent node.
  • children: list of the child node ids.

Listings

listings.tgz (5.5GB, MD5: b5de9f7d8db4a9b554881d53c17a9ef0)

The initial version of the file had MD5: 7ff1106e165cabdd5a21d62d0985f88e. However, due to the storage crash on our server in Sep 2016, the current version of the file has a different MD5.

Unarchive the listings using tar command.

tar xzf listings.tgz

The archive file contains JSON files, each containing up to 10,000 item listings.

listings/*.json

An example is shown below.

{
  "listing_id": 105604447,
  "state": "active",
  "user_id": 14182146,
  "category_id": 69188171,
  "title": "Red flora decoupage case for Blackberry 9700 / Blackberry9780 made to order / custom order / hard case / cover case / phone accessories",
  "description": "This's Decoupage handmade for iPphone 5 / iPhone 4s/ Blackberry 9700 / Blackberry9780 / \r\n\r\nSamsung Galaxy S III making form paper napkin 100% scratch proof and water proof.\r\n\r\n- Fits for all version of iPhone 4 / 4s Blackberry 9700 / Blackberry9780 / \r\nSamsung Galaxy S III (including Verizon, Sprint, AT&T and international iPhones)\r\n\r\n- Made one at a time using a technique similar to decoupage\r\n\r\n- We are made to order for Iphone4 / Iphone 4s/ Blackberry 9700 / Blackberry9780 / \r\nSamsung Galaxy S III.\r\n\r\nPayment & Shipping\r\n\r\n★ Every item will be send after your payment 2-3 days \r\n\r\n★ You should received in 2-4 weeks\r\n\r\n★ We send by airmail for all items.\r\n\r\n(If full order price reach US: $150 ,We offer free up grade shipping by EMS which is very fast and safe only take about 2-3 days)\r\n\r\n★ Please contact us, if you have any question.\r\n\r\nThank you for your interested",
  "creation_tsz": 1407565559,
  "ending_tsz": 1418109959,
  "original_creation_tsz": 1343619604,
  "last_modified_tsz": 1407565559,
  "price": "17.50",
  "currency_code": "USD",
  "quantity": 1,
  "tags": [
    "Decoupage",
    "Decoupage case",
    "Blackberry9700",
    "Vintage",
    "cover",
    "case",
    "napkin",
    "paper napkin",
    "cell phone",
    "Thai style",
    "Blackberry case",
    "summer case",
    "spring case"
  ],
  "category_path": [
    "Accessories",
    "Case",
    "Cell Phone"
  ],
  "category_path_ids": [
    69150467,
    68892196,
    69188171
  ],
  "materials": [
    "plastic case",
    "paper napkin"
  ],
  "shop_section_id": 10700190,
  "featured_rank": null,
  "state_tsz": 1407507029,
  "url": "https://www.etsy.com/listing/105604447/red-flora-decoupage-case-for-blackberry?utm_source=smartrecommender&utm_medium=api&utm_campaign=api",
  "views": 212,
  "num_favorers": 5,
  "shipping_template_id": null,
  "processing_min": null,
  "processing_max": null,
  "who_made": "i_did",
  "is_supply": "false",
  "when_made": "2010_2014",
  "is_private": false,
  "recipient": null,
  "occasion": null,
  "style": null,
  "non_taxable": false,
  "is_customizable": false,
  "is_digital": false,
  "file_data": "",
  "language": "en-US",
  "has_variations": false,
  "used_manufacturer": false,
  "Images": [
    {
      "listing_image_id": 361017843,
      "hex_code": "927F7A",
      "red": 146,
      "green": 127,
      "blue": 122,
      "hue": 12,
      "saturation": 16,
      "brightness": 57,
      "is_black_and_white": false,
      "creation_tsz": 1343619605,
      "listing_id": 105604447,
      "rank": 1,
      "url_75x75": "https://img1.etsystatic.com/002/0/6268251/il_75x75.361017843_8e5e.jpg",
      "url_170x135": "https://img1.etsystatic.com/002/0/6268251/il_170x135.361017843_8e5e.jpg",
      "url_570xN": "https://img1.etsystatic.com/002/0/6268251/il_570xN.361017843_8e5e.jpg",
      "url_fullxfull": "https://img1.etsystatic.com/002/0/6268251/il_fullxfull.361017843_8e5e.jpg",
      "full_height": 428,
      "full_width": 570
    }
  ]
}

The listing item contains various fields. The notable fields are the following.

  • listing_id: id of the item.
  • category_id: category id of the item. note that this is the primary association to the category, but the category node might have a parent category. The category_path_ids field contains all the associated categories including ancestors, and category_path field contains the names along the path.
  • tags: tags associated to the item.
  • title: title of the item.
  • description: description of the item.
  • user_id: id of the user (shop owner).
  • Images: array of images. Note that one item might contain multiple images. Our ECCV 2016 paper uses only the first image in the list.

Experimental configuration (Vittayakorn et al. 2016)

config-eccv2016.json.gz (40M)

The file contains experimental configuration in Vittayakorn et al. 2016.

The JSON file contains the following data structure.

  • listing_ids: Listing ids from clothing category used in the paper.
  • train, test, val: Train/test/validation split of the dataset in the experiment. Each contain the following.
    • listing_ids: Listing ids of the split.
    • weight: Parameter used to make train/test/val splits. Can be ignored.
  • words: List of top 250 adjective words ordered by descending order by frequency. Each object contains the following fields.
    • word: name of the word.
    • count: total occurrence of the word in the dataset.
    • train, test, val: Split of the dataset each containing the following.
      • positive_ids: Listing ids containing the given adjectives, up to 10,000.
      • negative_ids: Listing ids NOT containing the given adjectives, up to 10,000.

Usage

Loading categories and listings

There are 2.8 million listings. We recommend a key-value database (e.g., LMDB) for storing listings.

import json
from glob import glob

with open('categories.json', 'r') as f:
    categories = json.load(f)
for json_file in glob('listings/*.json'):
    with open(json_file, 'r') as f:
        listings = json.load(f)

Citation

Please cite one or both of the following if you use the dataset in your work.

Automatic Attribute Discovery with Neural Activations
Sirion Vittayakorn, Takayuki Umeda, Kazuhiko Murasaki, Kyoko Sudo, Takayuki Okatani, Kota Yamaguchi
ECCV 2016
arXiv

Learning to Describe E-Commerce Images from Noisy Online Data
Takuya Yashima, Naoaki Okazaki, Kentaro Inui, Kota Yamaguchi, Takayuki Okatani
ACCV 2016