notebookVB

This commit is contained in:
2026-06-01 07:24:46 +02:00
parent e3522f4017
commit 5073c01692
8 changed files with 1056 additions and 261 deletions
+512
View File
@@ -0,0 +1,512 @@
# Rohlik.cz Scraper — API & Data Notes
Co víme o rohlik.cz scrapingu k 2026-06-01. Tento dokument shrnuje endpointy,
tvary odpovědí, login flow a poznámky pro návrh databáze.
---
## 1. Login / session
### 1.1 API login (bez UI)
Stránka má klasický JSON endpoint, který se chová stejně jako přihlášení přes formulář:
```
POST https://www.rohlik.cz/services/frontend-service/login
Content-Type: application/json
Accept: application/json
{ "email": "...", "password": "..." }
```
Odpověď (status 200):
```json
{
"status": 200,
"messages": [],
"data": {
"status": ...,
"fbLoginUrl": "...",
"uniqsid": "...",
"user": { ... },
"address": { ... },
"zoneId": ...,
"availableStores": [...],
"features": [...],
"deliveryPoint": { ... },
"segment": "...",
"personalizationConsent": ...,
"newUserCreated": false,
"session": { ... },
"admin": false,
"isAuthenticated": true,
"store": ...,
"isAdmin": false
}
}
```
Po úspěšném loginu sedí v contextu cookie `PHPSESSION` na `.rohlik.cz`,
která drží přihlášení pro všechny další API calls.
### 1.2 Cloudflare cookies
První GET na `https://www.rohlik.cz` vyrobí cookie `cf_clearance` (Cloudflare
challenge JS běží automaticky v headful Playwrightu). Bez ní login API
nereaguje. Proto skript **nejdřív** otevře homepage, **pak** posílá login POST.
### 1.3 Cookie consent banner (Usercentrics)
- Banner se renderuje přes web-component `#usercentrics-cmp-ui` se shadow DOM.
- Pokus o klikání přes DOM selektory zvenku **nefunguje** — shadow root blokuje pointer events i pro elementy pod ním.
- Funkční cesta: oficiální JS API
```js
await window.UC_UI.acceptAllConsents();
await window.UC_UI.closeCMP();
```
- Banner mizí s ~1 s animací, takže po close je potřeba `wait_for_selector('#usercentrics-cmp-ui', state='detached')`.
- Souhlas se ukládá do `localStorage` (klíče `uc_user_interaction`, `uc_settings`, …) + cookie `consentTracked=true`.
### 1.4 Reuse: `auth_state.json`
`context.storage_state(path=...)` uloží cookies + localStorage. Při příštím
běhu se to nahraje přes `browser.new_context(storage_state=...)` a uživatel je:
- už přihlášený (login API se neopakuje),
- už má souhlas s cookies (banner se vůbec nezobrazí).
Implementace flow viz `test_login.py::ensure_logged_in()`:
1. načti `auth_state.json` pokud existuje,
2. otevři `BASE_URL`, zkontroluj `text="Přihlásit se"` (přítomné → nepřihlášen),
3. když nepřihlášen → `POST /services/frontend-service/login`, accept cookies, ulož state,
4. když přihlášen → rovnou jeď dál.
---
## 2. Kategorie
### 2.1 Hlavní kategorie
```
GET /api/v5/navigation/components/navigation-tabs/categories
```
Vrací list 17 hlavních kategorií. Každá obsahuje:
```json
{
"id": 300102000,
"name": "Ovoce a zelenina",
"link": "/c300102000-ovoce-a-zelenina",
"image": "/images/.../fruits-and-veggies.png",
"imageType": "rich",
...
}
```
Aktuální seznam (k dnešku):
| ID | Název |
|------------|----------------------|
| 300102000 | Ovoce a zelenina |
| 300105000 | Mléčné a chlazené |
| 300103000 | Maso a ryby |
| 300117503 | Grilování |
| 300101000 | Pekárna a cukrárna |
| 300104000 | Uzeniny a lahůdky |
| 300107000 | Mražené |
| 300121429 | Plant Based |
| 300106000 | Trvanlivé |
| 300108000 | Nápoje |
| 300112393 | Speciální výživa |
| 300124206 | Kosmetika |
| 300109000 | Drogerie |
| 300111000 | Domácnost a zahrada |
| 300110000 | Dítě |
| 300112000 | Zvíře |
| 300112985 | Lékárna |
> Hardcoded strom v `categories.py` je zastaralý (chybí Dítě, Zvíře, Lékárna).
> Doporučeno přejít na živé tahání z API.
### 2.2 Subkategorie (rekurzivně)
```
GET /api/v4/navigation/components/navigation-tabs/subcategories?categoryIds=<ID>
```
Vrací **flat list** dětí dané kategorie. Příklad jednoho prvku:
```json
{
"id": 300112001,
"name": "Pes",
"image": "/images/.../1342001-1531397856.jpg",
"imageColor": "var(--green-60)",
"link": "/c300112001-pes",
"imageLink": null,
"imageType": "rich",
"subcategoryIds": [300112002, 300112003, 300112004, 300112008, 300112009, 300118461, 300124184, 300124185]
}
```
Klíčový moment: pole `subcategoryIds` říká, že tento uzel má další děti.
Pro získání těch dětí musíme **opět zavolat stejný endpoint** s tímto ID jako parentem.
#### Rekurzivní algoritmus
```python
def fetch_children(parent_id, visited, depth=1, max_depth=6):
if str(parent_id) in visited or depth > max_depth: return []
visited.add(str(parent_id))
subs = GET /api/v4/.../subcategories?categoryIds={parent_id}
out = []
for s in subs:
node = {id, name, url, children: []}
if s.subcategoryIds:
node.children = fetch_children(s.id, visited, depth+1)
out.append(node)
return out
```
Implementace v `scrape_categories.py`. Výstup uložen v `categories_live.json`
jako `{tree: [{id, name, url, children: [...]}], raw_main: ...}`.
---
## 3. Listing produktů v kategorii
### 3.1 Endpoint
```
GET /api/v1/categories/normal/<categoryId>/products
?page=<N> # 0-based
&size=50 # max items per page
&sort=recommended
&filter=
&excludeProductIds=
```
### 3.2 Odpověď — jen IDs
```json
{
"categoryId": 300102013,
"categoryType": "normal",
"productIds": [1407650, 1354613, 1350461, ...],
"productsWithType": [{"id": 1407650, "type": "PRODUCT"}, ...],
"impressions": [],
"interactiveProductCardAds": [],
"pageable": {
"pageNumber": 0,
"pageSize": 50,
"sort": {...},
"offset": 0,
"unpaged": false,
"paged": true
}
}
```
Listing **nevrací detaily** — jen ID. Detail produktů se musí dotáhnout přes 5 batch endpointů (níže).
### 3.3 Stránkování
- `size=50` se chová jako horní limit; pokud kategorie má méně, vrátí všechno najednou.
- Konec stránek = první stránka, která vrátí prázdný `productIds`, **nebo** stránka s méně než `size` items.
---
## 4. Detail produktů — 5 paralelních batch endpointů
Stránka pro každou sadu ID volá **5 batch endpointů paralelně**, vždy s opakovaným query parametrem `?products=ID1&products=ID2&...`:
```
GET /api/v1/products?products=...
GET /api/v1/products/prices?products=...
GET /api/v1/products/stock?products=...
GET /api/v1/products/categories?products=...
GET /api/v1/products/user-data?products=...
```
> ⚠ `categoryType=normal` parametr stránka taky posílá — bezpečnější ho přidat.
> ⚠ Syntaxe je **opakovaný klíč**, ne čárka. `?products=1&products=2`, ne `?products=1,2`.
> ⚠ Existuje i `/api/v1/products/card?products=...` — listing ho **nepoužívá**. Vyhnout se.
### 4.1 `/api/v1/products` — základní info
```json
[
{
"id": 1407650,
"name": "Čerstvě utrženo Okurka hadovka, bez folie",
"slug": "cerstve-utrzeno-okurka-hadovka-bez-folie",
"mainCategoryId": 300102013,
"unit": "kg",
"textualAmount": "cca 380 g",
"weightedItem": true,
"packageRatio": null,
"brand": null,
"sellerId": 1,
"flag": "cz",
"archived": false,
"premiumOnly": false,
"type": "PRODUCT",
"images": [
"https://cdn.rohlik.cz/images/grocery/products/1407650/1407650-...jpg",
...
],
"countries": [
{ "name": "Česká republika", "nameId": "ceska-republika", "code": "CZ" }
],
"countryOfOriginFlagIcon": "https://cdn.rohlik.cz/images/countryFlags/cz.svg",
"badges": [
{ "type": "freshly-harvested", "title": "Čerstvě sklizeno", "subtitle": null, "tooltip": "" }
],
"filters": [],
"information": [],
"attachments": [],
"image3dData": null,
"adviceForSafeUse": null,
"productStory": null,
"canBeFavorite": true,
"canBeRated": true
}
]
```
| Pole | Typ | Popis |
|--------------------|----------|-------|
| `id` | int | Product ID |
| `name` | string | Plný název |
| `slug` | string | URL slug (`/{slug}-c{id}` nebo přes `/products/{id}-{slug}`) |
| `mainCategoryId` | int | ID kategorie kam patří |
| `unit` | string | "kg" / "ks" / "l" / ... — jednotka ceny |
| `textualAmount` | string | "cca 380 g" / "1 ks" / "500 ml" — pro zobrazení |
| `weightedItem` | bool | true = vážené (variabilní hmotnost), false = kusové |
| `brand` | string? | Značka nebo null |
| `flag` | string? | Země původu kód ("cz", "it", ...) |
| `images` | string[] | URL obrázků (první je hlavní) |
| `countries` | object[] | Strukturovaná země původu |
| `badges` | object[] | Štítky (bio, čerstvě sklizeno, …) |
| `archived` | bool | True = produkt už nabídku opustil |
| `premiumOnly` | bool | Jen pro Xtra členy |
### 4.2 `/api/v1/products/prices` — ceny
Bez slevy:
```json
[
{
"productId": 1407650,
"price": { "amount": 34.16, "currency": "CZK" },
"pricePerUnit": { "amount": 89.9, "currency": "CZK" },
"sales": [],
"lastMinuteTitle": null
}
]
```
Se slevou:
```json
{
"productId": 1437841,
"price": { "amount": 65.69, "currency": "CZK" },
"pricePerUnit": { "amount": 429.9, "currency": "CZK" },
"sales": [
{
"id": 12988802,
"type": "premium", // "premium" / "sale" / ...
"triggerAmount": 1,
"price": { "amount": 55.83, "currency": "CZK" },
"pricePerUnit": { "amount": 365.38, "currency": "CZK" },
"originalPrice": { "amount": 65.69, "currency": "CZK" },
"originalPricePerUnit": null,
"badges": [{ "type": "premium-discount", "title": "-15 %", "subtitle": null }],
"validTill": "2029-01-02T23:59:00+01:00",
"active": true,
"silent": false,
"bundleId": null
}
],
"lastMinuteTitle": null
}
```
| Pole | Cesta | Popis |
|------|-------|-------|
| Cena | `price.amount` | Aktuální cena za balení (Kč) |
| Cena/jednotku | `pricePerUnit.amount` | Cena za `unit` z `/products` |
| Akce | `sales[0].price.amount` | Pokud `sales` neprázdné |
| Typ akce | `sales[0].type` | `premium` (Xtra), `sale`, … |
| Štítek | `sales[0].badges[0].title` | "-10 %", "-15 %", ... |
| Platnost | `sales[0].validTill` | ISO datetime |
### 4.3 `/api/v1/products/stock` — skladovost
```json
[
{
"productId": 1407650,
"warehouseId": 8799,
"packageInfo": { "amount": 0.38, "unit": "kg" },
"inStock": false,
"maxBasketAmount": 0,
"maxBasketAmountReason": "AVAILABLE", // "ALLOWED" když lze koupit
"preorderEnabled": false,
"unavailabilityReason": null,
"deliveryRestriction": null,
"expectedReplenishment": null,
"availabilityDimension": 0,
"shelfLife": null, // { value, unit }
"billablePackaging": null, // záloha (lahve)
"freshness": null,
"premiumOnly": false,
"tooltips": [],
"sales": []
}
]
```
| Pole | Popis |
|------|-------|
| `inStock` | bool — skladem ano/ne |
| `maxBasketAmount` | int — max kusů do košíku |
| `packageInfo.amount` + `.unit` | Reálná hmotnost/objem balení (oproti `textualAmount` z base) |
| `warehouseId` | ID skladu (může se lišit podle adresy) |
| `shelfLife` | Trvanlivost (pokud uvedena) |
| `billablePackaging` | Zálohovaný obal (lahev atd.) |
### 4.4 `/api/v1/products/categories`
```json
[
{
"productId": 1407650,
"categories": [
{ "id": 300102000, "type": "normal", "name": "Ovoce a zelenina", "slug": "ovoce-a-zelenina", "level": 0 },
{ "id": 300102008, "type": "normal", "name": "Zelenina", "slug": "zelenina", "level": 1 },
{ "id": 300102013, "type": "normal", "name": "Okurky, cukety a lilky", "slug": "okurky-cukety-a-lilky", "level": 2 }
]
}
]
```
Plný strom kategorií od kořene k listu, `level=0` = hlavní. Užitečné, protože produkt může patřit do více kategorií (např. „Grilování" duplikuje listy z masa).
### 4.5 `/api/v1/products/user-data`
Per-user data (oblíbené, naposled koupeno…). Pro scraping cen **nepotřebujeme**, ale stránka to volá, takže když to vynecháme, vypadáme méně jako frontend.
---
## 5. Sample merged record
Po zavolání všech 5 endpointů a merge podle `productId`:
```json
{
"productId": 1407650,
"base": { ... pole z /products ... },
"prices": { ... pole z /products/prices ... },
"stock": { ... pole z /products/stock ... },
"categories": { ... pole z /products/categories ... },
"user_data": { ... pole z /products/user-data ... }
}
```
Reálná tabulka prvního leafu (Okurky, cukety a lilky → 17 produktů):
```
ID Skladem Cena Za jedn. Akce Název (balení)
1407650 ne 34.16 89.90/kg Čerstvě utrženo Okurka hadovka (cca 380 g)
1354613 ano 31.87 109.90/kg Okurka polní 1 ks (cca 290 g)
1294911 ano 49.90 49.90/ks 44.91 -10 % BIO Okurka hadovka 1 ks (1 ks)
...
```
---
## 6. Číselníky / enumy které jsme viděli
### Typ slevy (`sales[].type`)
- `"premium"` — Xtra members discount
- (`"sale"` — klasická akce, ne vlastní pozorování ale dle označení)
### `badges[].type` (base)
- `"freshly-harvested"`, `"bio"`, `"low-price"`, ...
### `maxBasketAmountReason`
- `"ALLOWED"` — normálně lze koupit
- `"AVAILABLE"` — vidíme když `inStock=false` (out of stock)
### `flag` (base) — kód země původu
- `"cz"`, `"it"`, `"de"`, ...
### `unit` (base)
- `"kg"`, `"l"`, `"ks"`, `"g"`, `"ml"`, ...
### `categoryType` (listing)
- `"normal"` — běžné kategorie
- (existují i `"premium"`, `"recipes"` aj., nepoužíváme)
---
## 7. Postup scrapingu (high level)
```
ensure_logged_in()
└─ načte auth_state.json NEBO se přihlásí přes API a uloží state
get_category_tree()
└─ rekurzivně přes /navigation-tabs/categories + /subcategories
└─ vrátí strom uzlů {id, name, url, children}
for each leaf in tree (without children):
page = 0
while True:
ids = GET /api/v1/categories/normal/{leaf.id}/products?page={page}&size=50
if not ids: break
all_ids += ids
if len(ids) < 50: break
page += 1
for chunk in chunks(all_ids, 30):
base = GET /api/v1/products?products=...
prices = GET /api/v1/products/prices?products=...
stock = GET /api/v1/products/stock?products=...
categories = GET /api/v1/products/categories?products=...
merged = merge by productId
upsert to MongoDB
```
---
## 8. Důležité poznámky / gotchas
- **Cloudflare**: vždy nejdřív otevřít homepage v Playwright contextu, pak teprve API.
- **Cookie consent**: pro pokud možno nenápadné chování přijmout cookies přes `UC_UI.acceptAllConsents()`. Uložený state ho už neukazuje.
- **Headers**: zatím nepotřebujeme posílat speciální `User-Agent` ani `X-...` — Playwright context cookies stačí.
- **Rate**: zatím netestováno. Stránka sama posílá 5 paralelních requestů per chunk + listing. Ne víc.
- **Velikost chunků**: 30 ID per batch nám prošlo bez problémů. URL délka by zvládla i víc, ale držme se toho, co reálně chrome dělá.
- **Identita produktu**: `id` v base / `productId` v ostatních endpointech — totéž. Není garantována stálost ID napříč warehouses (ale `warehouseId=8799` je nás stabilní zóna).
- **Sklad-specifická data**: cena, dostupnost i `warehouseId` se odvíjí od `zoneId` v session. Pokud měníme adresu, měníme i ceny → držet jednu doručovací adresu pro reprodukovatelnost.
- **Kategorie ne-listy**: hlavní kategorie zobrazují jen "Doporučujeme" (cca 5 produktů). Pro úplný katalog scrapovat **jen listy** stromu (uzly bez `children`).
- **Archived products**: `archived: true` znamená, že produkt už není v nabídce — uložit historicky, ale nemarkovat jako aktivní.
---
## 9. Soubory v projektu
| Soubor | Co dělá |
|--------|---------|
| `config.py` | Cesty + creds z `.env` |
| `test_login.py` | `ensure_logged_in()` — session reuse + API login + accept cookies |
| `scrape_categories.py` | Stáhne živý strom kategorií → `categories_live.json` |
| `scrape_first_leaf.py` | Demo: stáhne první leaf a vypíše produkty |
| `auth_state.json` | Cookies + localStorage (gitignored) |
| `categories_live.json` | Aktuální strom kategorií |
| `products_<id>.json` | Demo dump produktů z jedné kategorie |
| `scraper.py` | (zastaralý) původní DOM scraping přes Playwright |
| `categories.py` | (zastaralý) hardcoded strom kategorií |
| `db.py` | MongoDB ops — bude potřeba upravit pro nový tvar dat |
+93
View File
@@ -0,0 +1,93 @@
"""
Test DB layer: load products_300102013.json (already scraped data)
and upsert into MongoDB 'rohlik' database.
No scraping needed — just validates the db.py functions work
with real API response shapes.
"""
import json
import sys
import io
from pathlib import Path
from db import get_db, ensure_indexes, upsert_products, upsert_category
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8", errors="replace")
DATA_FILE = Path(__file__).parent / "products_300102013.json"
def main():
db = get_db()
print(f"Connected to: {db.client.address} / {db.name}")
ensure_indexes(db)
print("Indexes created.\n")
# --- test category upsert ---
upsert_category(db, {
"_id": 300102013,
"name": "Okurky, cukety a lilky",
"slug": "okurky-cukety-a-lilky",
"path": [300102000, 300102008, 300102013],
"pathNames": ["Ovoce a zelenina", "Zelenina", "Okurky, cukety a lilky"],
"parentId": 300102008,
"isLeaf": True,
})
print("Category 300102013 upserted.")
# --- load scraped products ---
products = json.loads(DATA_FILE.read_text(encoding="utf-8"))
print(f"Loaded {len(products)} products from {DATA_FILE.name}\n")
# split merged records back into the 4 lists that upsert_products expects
bases = []
prices_list = []
stocks = []
categories_list = []
for p in products:
base = p.get("base", {})
prices = p.get("prices", {})
stock = p.get("stock", {})
cats = p.get("categories", {})
bases.append(base)
prices_list.append(prices)
stocks.append(stock)
categories_list.append(cats)
upsert_products(db, bases, prices_list, stocks, categories_list)
print(f"Upserted {len(bases)} products.\n")
# --- verify ---
n_products = db.products.count_documents({})
n_history = db.price_history.count_documents({})
n_cats = db.categories.count_documents({})
print(f"DB counts:")
print(f" products: {n_products}")
print(f" price_history: {n_history}")
print(f" categories: {n_cats}")
# show one sample
sample = db.products.find_one({"_id": 1407650})
if sample:
print(f"\nSample product: {sample['name']}")
print(f" price: {sample['currentPrice']} {sample['currency']}")
print(f" per unit: {sample['currentPricePerUnit']}/{sample.get('unit', '?')}")
print(f" inStock: {sample['inStock']}")
print(f" sale: {sample['sale']}")
print(f" badges: {[b['title'] for b in sample.get('badges', [])]}")
# show price_history entry
hist = db.price_history.find_one({"productId": 1407650})
if hist:
print(f"\n price_history record: price={hist['price']}, "
f"inStock={hist['inStock']}, scrapedAt={hist['scrapedAt']}")
print("\nDone.")
if __name__ == "__main__":
main()
+94 -57
View File
@@ -1,15 +1,6 @@
"""
Rohlik.cz Price Scraper - Database Operations
Version: 1.0.0
Date: 2026-05-31
MongoDB operations for the Rohlik.cz price scraper.
Collections: products, price_history, categories, scrape_runs.
MongoDB server: 192.168.1.76 (no authentication).
"""
from datetime import datetime, timezone from datetime import datetime, timezone
from pymongo import MongoClient, ASCENDING
from pymongo import MongoClient, ASCENDING, DESCENDING, TEXT
from config import MONGO_URI, MONGO_DB from config import MONGO_URI, MONGO_DB
@@ -19,70 +10,116 @@ def get_db():
def ensure_indexes(db): def ensure_indexes(db):
db.products.create_index([("product_id", ASCENDING)], unique=True) db.categories.create_index("parentId")
db.products.create_index([("category_id", ASCENDING)]) db.categories.create_index("isLeaf")
db.products.create_index([("name", ASCENDING)])
db.price_history.create_index([("product_id", ASCENDING), ("scraped_at", ASCENDING)]) db.products.create_index("mainCategoryId")
db.price_history.create_index([("scraped_at", ASCENDING)]) db.products.create_index([("archived", ASCENDING), ("lastSeen", DESCENDING)])
db.products.create_index([("name", TEXT)])
db.categories.create_index([("category_id", ASCENDING)], unique=True) db.price_history.create_index([("productId", ASCENDING), ("scrapedAt", DESCENDING)])
db.price_history.create_index([("scrapedAt", DESCENDING)])
db.scrape_runs.create_index([("started_at", ASCENDING)]) db.scrape_runs.create_index([("startedAt", DESCENDING)])
def upsert_product(db, product: dict): def upsert_category(db, cat: dict):
db.categories.update_one(
{"_id": cat["_id"]},
{"$set": cat},
upsert=True,
)
def upsert_categories(db, cats: list[dict]):
for cat in cats:
upsert_category(db, cat)
def upsert_product(db, base: dict, prices: dict, stock: dict, categories: list[dict]):
now = datetime.now(timezone.utc) now = datetime.now(timezone.utc)
product_id = product["product_id"] product_id = base["id"]
sale_raw = prices.get("sales", [])
sale = None
if sale_raw:
s = sale_raw[0]
sale = {
"type": s.get("type"),
"price": s["price"]["amount"],
"pricePerUnit": s.get("pricePerUnit", {}).get("amount"),
"badge": (s.get("badges") or [{}])[0].get("title"),
"validTill": s.get("validTill"),
}
category_path = [c["id"] for c in categories] if categories else []
doc = {
"name": base["name"],
"slug": base.get("slug"),
"brand": base.get("brand"),
"unit": base.get("unit"),
"textualAmount": base.get("textualAmount"),
"weightedItem": base.get("weightedItem", False),
"mainCategoryId": base.get("mainCategoryId"),
"categoryPath": category_path,
"allCategories": [
{"id": c["id"], "name": c["name"], "level": c.get("level", 0)}
for c in categories
] if categories else [],
"countryCode": base.get("flag"),
"images": base.get("images", []),
"badges": base.get("badges", []),
"archived": base.get("archived", False),
"premiumOnly": base.get("premiumOnly", False),
"currentPrice": prices["price"]["amount"],
"currentPricePerUnit": prices.get("pricePerUnit", {}).get("amount"),
"currency": prices["price"].get("currency", "CZK"),
"sale": sale,
"inStock": stock.get("inStock", False),
"maxBasketAmount": stock.get("maxBasketAmount", 0),
"packageAmount": stock.get("packageInfo", {}).get("amount"),
"packageUnit": stock.get("packageInfo", {}).get("unit"),
"warehouseId": stock.get("warehouseId"),
"lastSeen": now,
"lastScrapedAt": now,
}
db.products.update_one( db.products.update_one(
{"product_id": product_id}, {"_id": product_id},
{ {
"$set": { "$set": doc,
"name": product["name"], "$setOnInsert": {"firstSeen": now},
"category_id": product.get("category_id"),
"category_name": product.get("category_name"),
"amount": product.get("amount"),
"unit_price": product.get("unit_price"),
"image_url": product.get("image_url"),
"product_url": product.get("product_url"),
"category_path": product.get("category_path"),
"updated_at": now,
},
"$setOnInsert": {
"created_at": now,
},
}, },
upsert=True, upsert=True,
) )
db.price_history.insert_one({ db.price_history.insert_one({
"product_id": product_id, "productId": product_id,
"price": product["price"], "scrapedAt": now,
"original_price": product.get("original_price"), "price": prices["price"]["amount"],
"discount_badge": product.get("discount_badge"), "pricePerUnit": prices.get("pricePerUnit", {}).get("amount"),
"unit_price": product.get("unit_price"), "inStock": stock.get("inStock", False),
"scraped_at": now, "sale": sale,
}) })
def upsert_category(db, category: dict): def upsert_products(db, bases: list, prices_list: list, stocks: list, categories_list: list):
now = datetime.now(timezone.utc) prices_map = {p["productId"]: p for p in prices_list}
db.categories.update_one( stock_map = {s["productId"]: s for s in stocks}
{"category_id": category["category_id"]}, cats_map = {c["productId"]: c.get("categories", []) for c in categories_list}
{
"$set": { for base in bases:
"name": category["name"], pid = base["id"]
"url": category["url"], upsert_product(
"parent_id": category.get("parent_id"), db,
"has_children": category.get("has_children", False), base,
"updated_at": now, prices_map.get(pid, {"price": {"amount": 0}}),
}, stock_map.get(pid, {}),
"$setOnInsert": {"created_at": now}, cats_map.get(pid, []),
}, )
upsert=True,
)
def log_scrape_run(db, run_data: dict): def log_scrape_run(db, run_data: dict):
run_data.setdefault("startedAt", datetime.now(timezone.utc))
db.scrape_runs.insert_one(run_data) db.scrape_runs.insert_one(run_data)
+355 -202
View File
@@ -1,254 +1,407 @@
""" """
Rohlik.cz Price Scraper - Main Scraper Rohlik.cz Price Scraper — API-based
Version: 1.0.0 Iterates leaf categories, fetches product IDs via listing API,
Date: 2026-05-31 pulls details from 4 batch endpoints, upserts into MongoDB.
Playwright-based scraper that iterates all leaf categories on Rohlik.cz,
scrolls to lazy-load every product card, and extracts pricing data from the DOM.
Supports authenticated scraping (prices differ for logged-in users).
Usage: Usage:
python scraper.py --no-db --visible # scrape to JSON, visible browser python scraper.py # all categories -> MongoDB
python scraper.py --no-db --filter "Brambory" # scrape single category to JSON python scraper.py --category "Ovoce a zelenina" # one main category only
python scraper.py # scrape to MongoDB python scraper.py --no-db # dry run, no DB writes
python scraper.py --visible # scrape to MongoDB, visible browser python scraper.py --visible # show browser window
""" """
import re import sys
import json import io
import argparse
import logging import logging
from datetime import datetime, timezone from datetime import datetime, timezone
from pathlib import Path
from playwright.sync_api import sync_playwright, Page from playwright.sync_api import sync_playwright
from config import ( from config import BASE_URL
BASE_URL, AUTH_STATE_PATH, from test_login import ensure_logged_in
ROHLIK_EMAIL, ROHLIK_PASSWORD, from db import get_db, ensure_indexes, upsert_products, upsert_categories, log_scrape_run
SCROLL_PAUSE, MAX_SCROLLS,
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding="utf-8", errors="replace")
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding="utf-8", errors="replace")
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s %(message)s",
datefmt="%H:%M:%S",
) )
from categories import get_leaf_categories, get_all_categories_flat
from db import get_db, ensure_indexes, upsert_product, upsert_category, log_scrape_run
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
log = logging.getLogger(__name__) log = logging.getLogger(__name__)
PAGE_SIZE = 50
CHUNK = 30
def parse_price(raw: str | None) -> float | None: MAIN_CATS_URL = f"{BASE_URL}/api/v5/navigation/components/navigation-tabs/categories"
if not raw: SUBCATS_URL = f"{BASE_URL}/api/v4/navigation/components/navigation-tabs/subcategories"
return None
digits = re.sub(r"[^\d]", "", raw) BATCH_ENDPOINTS = {
if not digits: "base": "/api/v1/products",
return None "prices": "/api/v1/products/prices",
return int(digits) / 100 "stock": "/api/v1/products/stock",
"categories": "/api/v1/products/categories",
}
def parse_original_price(raw: str | None) -> float | None: # ---------------------------------------------------------------------------
if not raw: # helpers
return None # ---------------------------------------------------------------------------
match = re.search(r"([\d\s]+[,.][\d]+)", raw.replace("\xa0", " "))
if match: def get_json(context, url, **params):
return float(match.group(1).replace(" ", "").replace(",", ".")) resp = context.request.get(url, params=params or None)
digits = re.sub(r"[^\d]", "", raw) if resp.status != 200:
if digits: raise RuntimeError(f"HTTP {resp.status}: {url[:120]}")
return float(digits) / 100 return resp.json()
def as_list(payload):
if isinstance(payload, list):
return payload
if isinstance(payload, dict):
for k in ("data", "products", "items"):
v = payload.get(k)
if isinstance(v, list):
return v
return []
def pick(d, *keys):
"""Return the first non-None value among the given keys."""
for k in keys:
if isinstance(d, dict) and d.get(k) is not None:
return d[k]
return None return None
def login(page: Page): # ---------------------------------------------------------------------------
log.info("Logging in to Rohlik.cz...") # category tree — live from API
page.goto(BASE_URL, wait_until="domcontentloaded", timeout=60000) # ---------------------------------------------------------------------------
page.wait_for_timeout(3000)
page.locator('text="Přihlásit se"').first.click() def normalize_main(payload):
page.wait_for_timeout(2000) if isinstance(payload, list):
return payload
page.locator('input[type="email"], input[name="email"]').first.fill(ROHLIK_EMAIL) for key in ("data", "categories", "items", "navigationTabs", "tabs"):
page.locator('input[type="password"], input[name="password"]').first.fill(ROHLIK_PASSWORD) v = payload.get(key)
page.locator('button[type="submit"]').first.click() if isinstance(v, list):
page.wait_for_timeout(5000) return v
if isinstance(v, dict):
page.context.storage_state(path=AUTH_STATE_PATH) for k2 in ("categories", "items", "tabs"):
log.info("Login successful, auth state saved.") if isinstance(v.get(k2), list):
return v[k2]
return []
def scroll_to_load_all(page: Page) -> int: def subs_from_payload(payload):
prev_count = 0 if isinstance(payload, list):
for i in range(MAX_SCROLLS): return payload
page.evaluate("window.scrollTo(0, document.body.scrollHeight)") if isinstance(payload, dict):
page.wait_for_timeout(int(SCROLL_PAUSE * 1000)) for k in ("data", "subcategories", "items", "categories"):
current_count = page.locator('[data-test^="productCard-AVAILABLE-"]').count() v = payload.get(k)
if current_count == prev_count and i > 2: if isinstance(v, list):
break return v
prev_count = current_count return []
return prev_count
def extract_products(page: Page, category: dict) -> list[dict]: def fetch_children_recursive(context, parent_id, visited, depth=1, max_depth=6):
products_data = page.evaluate(""" if str(parent_id) in visited or depth > max_depth:
() => {
const products = [];
document.querySelectorAll('[data-test^="productCard-AVAILABLE-"]').forEach(card => {
const id = card.getAttribute('data-test').replace('productCard-AVAILABLE-', '');
const nameEl = card.querySelector('[data-test="productCard-body-name"]');
const priceNoEl = card.querySelector('[data-test="productCard-body-price-priceNo"]');
const saleEl = card.querySelector('[data-test="productCard-body-price-sale"]');
const amountEl = card.querySelector('[data-test="productCard-footer-amount"]');
const unitPriceEl = card.querySelector('[data-test="productCard-footer-unitPrice"]');
const badgeEl = card.querySelector('[data-test="productCard-body-badge"]');
const imgEl = card.querySelector('img');
const linkEl = card.querySelector('a[href*="/"]');
products.push({
product_id: id,
name: nameEl?.textContent?.trim() || '',
price_raw: priceNoEl?.textContent?.trim() || '',
original_price_raw: saleEl?.textContent?.trim() || '',
amount: amountEl?.textContent?.trim() || '',
unit_price_raw: unitPriceEl?.textContent?.trim() || '',
discount_badge: badgeEl?.textContent?.trim() || '',
image_url: imgEl?.src || '',
product_url: linkEl?.getAttribute('href') || '',
});
});
return products;
}
""")
results = []
for p in products_data:
results.append({
"product_id": p["product_id"],
"name": p["name"],
"price": parse_price(p["price_raw"]),
"original_price": parse_original_price(p["original_price_raw"]),
"discount_badge": p["discount_badge"] or None,
"amount": p["amount"] or None,
"unit_price": p["unit_price_raw"].strip() or None,
"image_url": p["image_url"] or None,
"product_url": f"{BASE_URL}{p['product_url']}" if p["product_url"] else None,
"category_id": category["id"],
"category_name": category["name"],
"category_path": " > ".join(category.get("path", [category["name"]])),
})
return results
def scrape_leaf(page: Page, category: dict) -> list[dict]:
url = f"{BASE_URL}{category['url']}"
log.info("Scraping: %s (%s)", " > ".join(category.get("path", [category["name"]])), url)
page.goto(url, wait_until="domcontentloaded", timeout=60000)
page.wait_for_timeout(3000)
try:
page.wait_for_selector('[data-test^="productCard-AVAILABLE-"]', timeout=15000)
except Exception:
log.warning(" No products found in %s, skipping.", category["name"])
return [] return []
visited.add(str(parent_id))
total = scroll_to_load_all(page) sub_payload = get_json(context, SUBCATS_URL, categoryIds=str(parent_id))
products = extract_products(page, category) subs = subs_from_payload(sub_payload)
log.info(" %d products extracted (loaded %d)", len(products), total)
return products out = []
for s in subs:
if not isinstance(s, dict):
continue
sid = pick(s, "id", "categoryId")
node = {
"id": sid,
"name": pick(s, "name", "title", "label"),
"url": pick(s, "url", "link", "slug"),
"children": [],
}
if sid and s.get("subcategoryIds"):
node["children"] = fetch_children_recursive(context, sid, visited, depth + 1, max_depth)
out.append(node)
return out
def run_scraper( def fetch_category_tree(context):
category_filter: str | None = None, """Fetch full category tree live from Rohlik API."""
headless: bool = True, log.info("Fetching main categories ...")
save_to_db: bool = True, main_payload = get_json(context, MAIN_CATS_URL)
): main_cats = normalize_main(main_payload)
leaves = get_leaf_categories() log.info(" %d main categories", len(main_cats))
if category_filter:
category_filter_lower = category_filter.lower()
leaves = [c for c in leaves if category_filter_lower in " > ".join(c["path"]).lower()]
log.info("Will scrape %d leaf categories", len(leaves)) tree = []
visited = set()
log.info("Fetching subcategories recursively ...")
for cat in main_cats:
cid = pick(cat, "id", "categoryId")
cname = pick(cat, "name", "title", "label")
curl = pick(cat, "url", "link", "slug")
if not cid:
continue
children = fetch_children_recursive(context, cid, visited)
node = {"id": cid, "name": cname, "url": curl, "children": children}
tree.append(node)
n_desc = count_nodes(children)
log.info(" - %s -> %d subcategories", cname, n_desc)
total = count_nodes(tree)
log.info(" Total: %d categories (incl. main)", total)
return tree
def count_nodes(nodes):
total = len(nodes)
for n in nodes:
total += count_nodes(n.get("children", []))
return total
def collect_leaves(nodes, path=None):
"""Return flat list of leaf nodes with their full path."""
if path is None:
path = []
leaves = []
for n in nodes:
current = path + [n["name"]]
children = n.get("children") or []
if children:
leaves.extend(collect_leaves(children, current))
else:
leaves.append({**n, "path": current})
return leaves
def tree_to_db_docs(nodes, parent_id=None, path=None, path_names=None):
"""Convert tree nodes to flat category docs for MongoDB."""
if path is None:
path = []
if path_names is None:
path_names = []
docs = []
for n in nodes:
cur_path = path + [n["id"]]
cur_names = path_names + [n["name"]]
children = n.get("children") or []
docs.append({
"_id": n["id"],
"name": n["name"],
"slug": (n.get("url") or "").lstrip("/"),
"path": cur_path,
"pathNames": cur_names,
"parentId": parent_id,
"isLeaf": len(children) == 0,
})
if children:
docs.extend(tree_to_db_docs(children, n["id"], cur_path, cur_names))
return docs
# ---------------------------------------------------------------------------
# product fetching
# ---------------------------------------------------------------------------
def fetch_product_ids(context, category_id):
"""Paginate through listing API, return all product IDs for a leaf."""
all_ids = []
page = 0
while True:
url = (f"{BASE_URL}/api/v1/categories/normal/{category_id}/products"
f"?page={page}&size={PAGE_SIZE}&sort=recommended&filter=&excludeProductIds=")
data = get_json(context, url)
ids = data.get("productIds") or []
all_ids.extend(ids)
if len(ids) < PAGE_SIZE:
break
page += 1
return all_ids
def fetch_batch(context, endpoint, product_ids):
qs = "&".join(f"products={pid}" for pid in product_ids)
url = f"{BASE_URL}{endpoint}?{qs}"
return as_list(get_json(context, url))
def fetch_product_details(context, product_ids):
"""For a chunk of IDs, call 4 batch endpoints and return raw lists."""
bases = fetch_batch(context, BATCH_ENDPOINTS["base"], product_ids)
prices = fetch_batch(context, BATCH_ENDPOINTS["prices"], product_ids)
stocks = fetch_batch(context, BATCH_ENDPOINTS["stock"], product_ids)
cats = fetch_batch(context, BATCH_ENDPOINTS["categories"], product_ids)
return bases, prices, stocks, cats
# ---------------------------------------------------------------------------
# console output
# ---------------------------------------------------------------------------
def print_header():
log.info("=" * 100)
log.info(" ROHLIK.CZ PRICE SCRAPER")
log.info("=" * 100)
def print_category_header(leaf, leaf_idx, total_leaves):
path_str = " > ".join(leaf["path"])
log.info("")
log.info("-" * 100)
log.info(" [%d/%d] %s (id=%s)", leaf_idx, total_leaves, path_str, leaf["id"])
log.info("-" * 100)
def print_products_table(bases, prices_list, stocks):
"""Print a compact table of products in this chunk."""
prices_map = {p["productId"]: p for p in prices_list}
stock_map = {s["productId"]: s for s in stocks}
for b in bases:
pid = b["id"]
p = prices_map.get(pid, {})
s = stock_map.get(pid, {})
name = b.get("name", "?")[:50]
price = p.get("price", {}).get("amount")
ppu = p.get("pricePerUnit", {}).get("amount")
unit = b.get("unit", "")
in_stock = s.get("inStock")
stock_str = "+" if in_stock else "-" if in_stock is False else "?"
sale_str = ""
sales = p.get("sales") or []
if sales:
sp = sales[0].get("price", {}).get("amount")
badge = (sales[0].get("badges") or [{}])[0].get("title", "")
if sp:
sale_str = f"{sp:.2f} {badge}"
price_str = f"{price:.2f}" if isinstance(price, (int, float)) else "?"
ppu_str = f"{ppu:.2f}/{unit}" if isinstance(ppu, (int, float)) else ""
log.info(" %s %9d %8s %12s %14s %s",
stock_str, pid, price_str, ppu_str, sale_str, name)
def print_summary(stats):
log.info("")
log.info("=" * 100)
log.info(" DONE")
log.info(" Categories: %d", stats["categories_scraped"])
log.info(" Products: %d unique", stats["products_total"])
log.info(" Duration: %.1f s", stats["duration_seconds"])
if stats.get("errors"):
log.info(" Errors: %d", stats["errors"])
log.info("=" * 100)
# ---------------------------------------------------------------------------
# main
# ---------------------------------------------------------------------------
def run_scraper(category_filter=None, headless=True, save_to_db=True):
db = None
if save_to_db:
db = get_db()
ensure_indexes(db)
with sync_playwright() as pw: with sync_playwright() as pw:
ctx_args = {} context, page = ensure_logged_in(pw, headless=headless)
if Path(AUTH_STATE_PATH).exists():
ctx_args["storage_state"] = AUTH_STATE_PATH
browser = pw.chromium.launch(headless=headless) # fetch live category tree from API
context = browser.new_context(**ctx_args) tree = fetch_category_tree(context)
page = context.new_page()
page.goto(BASE_URL, wait_until="domcontentloaded", timeout=60000) # filter to one main category if requested
page.wait_for_timeout(5000) if category_filter:
cf = category_filter.lower()
tree = [t for t in tree if cf in t["name"].lower()]
if not tree:
raise SystemExit(f"No main category matching '{category_filter}'")
is_logged_in = page.locator('text="Přihlásit se"').count() == 0 leaves = collect_leaves(tree)
if not is_logged_in: log.info("Scraping %d leaf categories", len(leaves))
if ROHLIK_EMAIL and ROHLIK_PASSWORD:
login(page) # save categories to MongoDB
context = browser.new_context(storage_state=AUTH_STATE_PATH) if db is not None:
page = context.new_page() cat_docs = tree_to_db_docs(tree)
else: upsert_categories(db, cat_docs)
log.warning("Not logged in! Prices may differ from member prices.") log.info("Upserted %d category docs", len(cat_docs))
print_header()
run_start = datetime.now(timezone.utc) run_start = datetime.now(timezone.utc)
all_products = []
seen_ids = set() seen_ids = set()
total_products = 0
errors = 0
db = None for i, leaf in enumerate(leaves, 1):
if save_to_db: print_category_header(leaf, i, len(leaves))
db = get_db()
ensure_indexes(db)
for cat_data in get_all_categories_flat():
upsert_category(db, cat_data)
for leaf in leaves:
try: try:
products = scrape_leaf(page, leaf) product_ids = fetch_product_ids(context, leaf["id"])
for p in products: log.info(" %d product IDs", len(product_ids))
if p["product_id"] not in seen_ids:
seen_ids.add(p["product_id"]) if not product_ids:
all_products.append(p) continue
if db:
upsert_product(db, p) # deduplicate within run
new_ids = [pid for pid in product_ids if pid not in seen_ids]
seen_ids.update(product_ids)
# process in chunks
for j in range(0, len(new_ids), CHUNK):
chunk = new_ids[j:j + CHUNK]
bases, prices, stocks, cats = fetch_product_details(context, chunk)
print_products_table(bases, prices, stocks)
if db is not None:
upsert_products(db, bases, prices, stocks, cats)
total_products += len(bases)
except Exception: except Exception:
log.exception("Error scraping %s", leaf["name"]) log.exception(" ERROR in %s", leaf["name"])
errors += 1
context.browser.close()
run_end = datetime.now(timezone.utc) run_end = datetime.now(timezone.utc)
run_data = { stats = {
"started_at": run_start, "startedAt": run_start,
"finished_at": run_end, "finishedAt": run_end,
"duration_seconds": (run_end - run_start).total_seconds(), "duration_seconds": (run_end - run_start).total_seconds(),
"categories_scraped": len(leaves), "categories_scraped": len(leaves),
"products_scraped": len(all_products), "products_total": total_products,
"errors": errors,
"filter": category_filter,
} }
if db: if db is not None:
log_scrape_run(db, run_data) log_scrape_run(db, stats)
log.info( print_summary(stats)
"Done: %d unique products from %d categories in %.1fs", return stats
len(all_products), len(leaves), run_data["duration_seconds"],
)
browser.close()
return all_products
def scrape_to_json(output_path: str = "products.json", **kwargs):
products = run_scraper(save_to_db=False, **kwargs)
with open(output_path, "w", encoding="utf-8") as f:
json.dump(products, f, ensure_ascii=False, indent=2, default=str)
log.info("Saved %d products to %s", len(products), output_path)
return products
if __name__ == "__main__": if __name__ == "__main__":
import argparse parser = argparse.ArgumentParser(description="Rohlik.cz price scraper (API)")
parser.add_argument("--category", type=str, help="Scrape only this main category (e.g. 'Ovoce a zelenina')")
parser = argparse.ArgumentParser(description="Rohlik.cz price scraper") parser.add_argument("--no-db", action="store_true", help="Dry run — no MongoDB writes")
parser.add_argument("--no-db", action="store_true", help="Save to JSON instead of MongoDB") parser.add_argument("--visible", action="store_true", help="Show browser window")
parser.add_argument("--visible", action="store_true", help="Run browser in visible mode")
parser.add_argument("--filter", type=str, help="Filter categories by name (e.g. 'Ovoce', 'Zelenina > Rajčata')")
args = parser.parse_args() args = parser.parse_args()
if args.no_db: run_scraper(
scrape_to_json(category_filter=args.filter, headless=not args.visible) category_filter=args.category,
else: headless=not args.visible,
run_scraper(category_filter=args.filter, headless=not args.visible) save_to_db=not args.no_db,
)
+2 -2
View File
@@ -50,11 +50,11 @@ def api_login(context: BrowserContext) -> int:
return resp.status return resp.status
def ensure_logged_in(pw) -> tuple[BrowserContext, Page]: def ensure_logged_in(pw, headless=False) -> tuple[BrowserContext, Page]:
auth_path = Path(AUTH_STATE_PATH) auth_path = Path(AUTH_STATE_PATH)
have_state = auth_path.exists() have_state = auth_path.exists()
browser = pw.chromium.launch(headless=False, args=["--start-maximized"]) browser = pw.chromium.launch(headless=headless, args=["--start-maximized"])
ctx_args = {"no_viewport": True} ctx_args = {"no_viewport": True}
if have_state: if have_state:
ctx_args["storage_state"] = AUTH_STATE_PATH ctx_args["storage_state"] = AUTH_STATE_PATH