Преобразование данных не учитывается в AWS Glue
У меня есть bucket S3 с папками, в которых лежат файлы. Я хочу сделать базу данных, чтобы иметь возможность запрашивать эти документы по нескольким ключам с помощью API на основе Lambda. Но для этого мне нужно нормализовать данные. Например, мне нужно преобразовать все файлы в папке /jomalone/ следующим образом:
{
"data": {
"products": {
"items": [
{
"default_category": {
"id": "25956",
"value": "Bath & Body"
},
"description": "London's Covent Garden early morning market. Succulent nectarine, peach and cassis and delicate spring flowers melt into the note of acacia honey. Sweet and delightfully playful. Our luxuriously rich Body Crème with its conditioning oils of jojoba seed, cocoa seed and sweet almond, help to hydrate, nourish and protect the skin, while delicious signature fragrances leave your body scented all over.",
"display_name": "Nectarine Blossom & Honey Body Crème",
"is_hazmat": false,
"meta": {
"description": "The Jo Malone™ Nectarine Blossom & Honey Body Crème leaves skin beautifully scented with fruity notes of nectarine and peach sweetened with acacia honey."
},
"product_badge": null,
"product_id": "10024",
"product_url": "/product/25956/10024/bath-body/nectarine-blossom-honey-body-creme",
"short_description": "A caring Body Crème infused with the succulent scent of Nectarine Blossom & Honey.",
"tags": {
"total": 2,
"items": [
{
"id": "25956",
"value": "Bath & Body",
"key": "bath-body"
},
{
"id": "26087",
"value": "Nectarine Blossom & Honey Scent",
"key": "nectarine-blossom-honey-scent"
}
]
},
"cross_sell": [
{
"sku_id": "L0Y401",
"sort_key": 1
},
{
"sku_id": "L0YF01",
"sort_key": 2
},
{
"sku_id": "L01G01",
"sort_key": 3
},
{
"sku_id": "L8CC01",
"sort_key": 4
},
{
"sku_id": "L8CA01",
"sort_key": 5
},
{
"sku_id": "L7XW01",
"sort_key": 6
}
],
"maincat": [
{
"key": "bathbody-maincat",
"value": "bathbody_maincat"
}
],
"subcat": [
{
"key": "bodycare-subcat",
"value": "bodycare_subcat"
}
],
"media": null,
"reviews": {
"average_rating": null,
"number_of_reviews": null
},
"usage": [
{
"content": "Take a generous amount of our luxurious Body Crème and massage into skin.",
"label": "HOW DOES IT WORK",
"type": "how_does_it_work"
}
],
"fragrance_family": [
{
"key": "fruity-fragrance",
"value": "fruity_fragrance"
}
],
"style": [
{
"key": "decadent-style",
"value": "decadent_style"
}
],
"mood": [
{
"key": "cosy-mood",
"value": "cosy_mood"
}
],
"skus": {
"total": 1,
"items": [
{
"is_default_sku": false,
"is_discountable": true,
"is_giftwrap": false,
"is_under_weight_hazmat": false,
"iln_listing": "Ingredients: Water\\Aqua\\Eau, Glycerin, Cetearyl Alcohol, Simmondsia Chinensis (Jojoba) Seed Oil, Fragrance (Parfum), Glyceryl Stearate, Stearic Acid, Triethanolamine, Theobroma Cacao (Cocoa) Seed Butter, Prunus Amygdalus Dulcis (Sweet Almond) Oil, Isopropyl Palmitate, Dimethicone, Aloe Barbadensis Leaf Juice, Bisabolol, Caffeine, Cocamidopropyl Pg-Dimonium Chloride Phosphate, Glyceryl Laurate, Hexylene Glycol, Caprylyl Glycol, Disodium Edta, Citral, Limonene, Citronellol, Linalool, Phenoxyethanol <ILN47239>",
"iln_version_number": "ILN47239",
"inventory_status": "Active",
"material_code": "L4P8010000",
"prices": [
{
"currency": "EUR",
"is_discounted": false,
"include_tax": {
"price": 68,
"original_price": 68,
"price_per_unit": 38.86,
"price_formatted": "€68.00",
"original_price_formatted": "€68.00",
"price_per_unit_formatted": "€38.86 / 100ML"
}
}
],
"sizes": [
{
"value": "175ML",
"key": 1
}
],
"shades": [
{
"name": "",
"description": "",
"hex_val": ""
}
],
"sku_id": "L4P801",
"sku_badge": null,
"unit_size_formatted": "100ML",
"upc": "690251040254",
"is_engravable": null,
"perlgem": {
"SKU_BASE_ID": 63584
},
"media": {
"large": [
{
"src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_0.png",
"alt": "Nectarine Blossom & Honey Body Crème",
"height": 1000,
"width": 1000
},
{
"src": "/media/export/cms/products/1000x1000/jo_sku_L4P801_1000x1000_1.png",
"alt": "Nectarine Blossom & Honey Body Crème",
"height": 1000,
"width": 1000
}
],
"medium": [
{
"src": "/media/export/cms/products/670x670/jo_sku_L4P801_670x670_0.png",
"alt": "Nectarine Blossom & Honey Body Crème",
"height": 670,
"width": 670
}
],
"small": [
{
"src": "/media/export/cms/products/100x100/jo_sku_L4P801_100x100_0.png",
"alt": "Nectarine Blossom & Honey Body Crème",
"height": 100,
"width": 100
}
]
},
"collection": null,
"recipient": [
{
"key": "mom-recipient",
"value": "mom_recipient"
},
{
"key": "bride-recipient",
"value": "bride_recipient"
},
{
"key": "host-recipient",
"value": "host_recipient"
},
{
"key": "me-recipient",
"value": "me_recipient"
},
{
"key": "her-recipient",
"value": "her_recipient"
}
],
"occasion": [
{
"key": "thankyou-occasion",
"value": "thankyou_occasion"
},
{
"key": "birthday-occasion",
"value": "birthday_occasion"
},
{
"key": "treat-occasion",
"value": "treat_occasion"
}
],
"location": [
{
"key": "bathroom-location",
"value": "bathroom_location"
}
]
}
]
}
}
]
}
}
}
В json со следующей схемой:
brandName String
productName String
productLink String
productType ?
maleFemale Male/Female
price float
unitPrice String
size float
ingredients String
notes String
numReviews Int
userIDs float
locations float
dates Date
ages int
sexes M/F
ratings Int
reviews Array of String
sources String
characteristics String
specificRatings String
Поэтому я попробовал AWS Glue, но не знаю, как избавиться от вложенных данных в виде ключей в начале:
"data": {
"products": {
"items": [
...
Действительно, я тестировал модификации на именах:
Но это, похоже, не имеет никаких последствий, которые я искал, если верить вкладке Preview:
Я действительно удалил первое и последнее поля soubrayed и изменил остальные, но, похоже, ничего из этого не было учтено в предварительном просмотре.
Действительно, не похоже, что есть хотя бы маппинг на сгенерированный скрипт из задания vsual:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
format_options={"multiline": False},
connection_type="s3",
format="json",
connection_options={"paths": ["s3://datahubpredicity/JoMalone/"], "recurse": True},
transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1,
mappings=[("data.products.items", "array", "data.products.items", "array")],
transformation_ctx="ApplyMapping_node2",
)
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.write_dynamic_frame.from_options(
frame=ApplyMapping_node2,
connection_type="s3",
format="json",
connection_options={"path": "s3://datahubpredicity/merged/", "partitionKeys": []},
transformation_ctx="S3bucket_node3",
)
job.commit()

