This commit is contained in:
2026-06-17 15:05:10 +02:00
parent de959d849d
commit 4884117227
85 changed files with 34611 additions and 0 deletions
+70
View File
@@ -0,0 +1,70 @@
# sipiq_import_v1.0 — import SIPIQ odpovědí do MongoDB
**Verze:** 1.0 · **Datum:** 2026-06-17 · **Studie:** 77242113UCO3002 (ICONIC / DAWN)
## Účel
Import SIPIQ odpovědí (Qualtrics CSV export) do MongoDB `feasibility` tak, aby šlo:
1. **křížově analyzovat** „otázka × otázka" (ploché `answers{}` keyed by Qcode),
2. **zrekonstruovat kompletní SIPIQ** jako v prázdném PDF, jen vyplněný (slovník otázek
se sekcemi / pořadím / popisky podčástí / typem / options).
## Vstup
Qualtrics **CSV** export (Download a data table → CSV, *Download all fields*, *Export labels*,
desetinná **tečka** = NEzaškrtnuto „Use commas for decimals"). CSV má 3 hlavičkové řádky:
- ř.1 = Qcode (Q2, Q6_4, Q31#1_1 …)
- ř.2 = **text otázky** (legenda)
- ř.3 = `{"ImportId":"QID…"}` = QID kód shodný s XML exportem (most XML↔CSV)
XML export NEobsahuje text otázky (jen QID tagy) → proto importujeme z CSV.
## Dvě kolekce v `feasibility`
### `sipiq_questions` — slovník dotazníku (1 dok = 1 logická otázka)
`{_id=Qcode báze (Q63), order, qnum, section, qids[QID…], text, type, items[{key,qcode,qid,label}], options[]}`
- `type`: `single_or_text` | `yesno` | `numeric` | `matrix_yesno` | `matrix_percent` | `matrix`
- `items[]` = podčásti (řádky matic, části %, kontaktní pole) v pořadí; `key` = sanitizovaný Qcode (`#`/`.``_`)
- `options[]` = odvozené z pozorovaných hodnot (yes/no a single-choice)
- Idempotentní `replace_one(upsert)`. Stav 17JUN2026: **56 otázek** (27 vícedílných).
- **STEM_OVERRIDE**: u maticových otázek (Q31/Q63/Q64/Q69) Qualtrics v CSV hlavičce text ořezává „…",
proto plné znění doplněno z prázdného SIPIQ PDF.
### `sipiq_responses` — 1 dok = 1 odpověď
- `_id` = **Qualtrics ResponseId** (`R_…`, unikátní, stálý)
- identita centra/PI povýšená nahoru (`site_*`, `pi_*`, `sdl_site_id`, `fire_*`, `mailinglist_id`,
`recipient_*`) → queryable
- `meta{}` = dates, status, progress, finished, duration, jazyk, kanál, IP, geo, survey date/time
- `answers{}` = **plochá mapa** Qcode→hodnota (`answers.Q37_1`, `answers.Q63_1_1`) — jádro pro křížovou analýzu
- `is_full_sipiq`, `interested` (Q25) pro pohodlí
- **`investigator_oid`** = ObjectId ref na `feasibility.investigators` (+`investigator_match` = jak)
- delta bookkeeping: `content_sha256`, `source_file`, `first_imported_at`, `last_seen_at`,
`last_updated_at`, `history[]`
## Delta import (přepíše JEN změněná data)
- nová odpověď → INSERT
- existuje, beze změn (shodný `content_sha256`) → aktualizuje pouze `last_seen_at`
- existuje, změna → `$set` jen změněných polí + `$push` do `history[]` `{changed_at, source_file, changes:[{key,old,new}]}`
## Soft-link na investigators (nedestruktivní)
1. `pi_email` == `email`/`email2` (lowercase), 2. `recipient_email`, 3. fallback příjmení
(bez diakritiky) + země. Reportuje napárování + KROK. **investigators se NEMĚNÍ.**
## Použití
```
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.0.py --csv "<cesta.csv>" --dry-run
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.0.py --csv "<cesta.csv>" --apply
```
`--scope czsk` (default, jen CZ+SK) | `--scope all` (všech 276). Mongo 192.168.1.76:27017, bez auth, pymongo.
## Stav 17JUN2026 (ostrý běh proveden)
- `sipiq_questions`: 56 · `sipiq_responses`: 15 (CZ 8 + SK 7)
- **soft-link 15/15 přes e-mail, všech 15 = KROK 7** (validace: vyplněné SIPIQ = naši KROK-7 investigátoři)
- `investigator_oid` uložen jako ObjectId → připraveno na `$lookup`
## Dotazy (příklady)
```js
// křížově: kdo očekává problémy s náborem A má >X eligible
db.sipiq_responses.find({"answers.Q33":"Yes"}, {pi_last_name:1,"answers.Q37_1":1})
// join s evidencí investigatora
db.sipiq_responses.aggregate([{$lookup:{from:"investigators",localField:"investigator_oid",
foreignField:"_id",as:"inv"}}])
// rekonstrukce SIPIQ: seřaď sipiq_questions dle order, pro každou otázku/item vezmi answers[key]
```
+534
View File
@@ -0,0 +1,534 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
sipiq_import_v1.0.py
====================
Verze: 1.0
Datum: 2026-06-17
Autor: Claude Code (pro MUDr. Vladimíra Buzalku)
Popis
-----
Import SIPIQ odpovědí (Qualtrics CSV export, studie 77242113UCO3002 / ICONIC DAWN)
do MongoDB db `feasibility`. Cílem je:
(a) umožnit křížovou analýzu „otázka × otázka" (ploché odpovědi keyed by Qcode),
(b) umožnit zrekonstruovat KOMPLETNÍ SIPIQ tak, jak ho zkoušející vidí v PDF,
jen s vyplněnými odpověďmi (slovník otázek se sekcí/pořadím/popisky).
Dvě kolekce v db `feasibility`:
* sipiq_questions slovník dotazníku (1 dok = 1 logická otázka; section, order,
text, items[], type, options). Idempotentní (upsert dle _id).
* sipiq_responses 1 dok = 1 odpověď (_id = Qualtrics ResponseId). Identita centra/PI
nahoře, ploché answers{}, meta{}, soft-link investigator_oid,
delta bookkeeping (content_sha256, history[], timestamps).
DELTA import (přepíše JEN změněná data):
- nová odpověď -> insert
- existuje, beze změn -> aktualizuje pouze last_seen_at (+ source_file)
- existuje, něco se změnilo -> $set jen změněných polí + push do history[] {key,old,new}
Soft-link na feasibility.investigators:
- primárně pi_email == email / email2 (lowercase)
- fallback příjmení (bez diakritiky, lower) + země (CZ/SK)
- nedestruktivní: kolekci investigators NEMĚNÍ, jen ukládá investigator_oid do response.
Rozsah: default CZ + SK (--scope czsk). --scope all = všech 276.
Použití:
python sipiq_import_v1.0.py --csv "<cesta.csv>" --dry-run
python sipiq_import_v1.0.py --csv "<cesta.csv>" --apply
Závislosti: pymongo (.venv). Mongo 192.168.1.76:27017, bez auth.
"""
import argparse
import csv
import hashlib
import json
import re
import sys
import unicodedata
from datetime import datetime, timezone
try:
from pymongo import MongoClient
except ImportError:
print("CHYBA: pymongo není nainstalován v aktuálním pythonu.", file=sys.stderr)
raise
MONGO_URI = "mongodb://192.168.1.76:27017"
DB_NAME = "feasibility"
COL_Q = "sipiq_questions"
COL_R = "sipiq_responses"
# Qualtrics systémová meta pole (NEjdou do answers)
META_COLS = {
"StartDate", "EndDate", "Status", "IPAddress", "Progress", "Duration (in seconds)",
"Finished", "RecordedDate", "ResponseId", "RecipientLastName", "RecipientFirstName",
"RecipientEmail", "ExternalReference", "LocationLatitude", "LocationLongitude",
"DistributionChannel", "UserLanguage",
}
# Embedded SDL pole povýšená nahoru do dokumentu (queryable identita)
PROMOTE = [
"site_name", "site_address", "site_city", "site_state", "site_postcode", "site_country",
"pi_first_name", "pi_last_name", "pi_phone", "pi_email",
"sdl_site_id", "fire_site_id", "fire_investigator_id", "mailinglist_id",
"survey_generated_by", "Date", "Time",
]
# Sekce dle ověřeného katalogu (mapování báze Q-čísla -> sekce v PDF)
SECTION_BY_QNUM = {}
def _sec(rng, name):
for n in rng:
SECTION_BY_QNUM[n] = name
_sec([2], "J&J Internal Assessment")
_sec([6, 7, 8, 9, 10, 11, 12, 13], "Contact Information")
_sec(range(14, 22), "Confidentiality Statement")
_sec([25, 26, 27], "Interest")
_sec([29, 30, 31, 32, 33, 34], "Protocol Requirements")
_sec([36, 37, 38], "Enrollment")
_sec([40, 41, 42, 43], "Patient Demographics Overview")
_sec([45, 46, 47, 48, 49], "Site Overview")
_sec([51], "Operational Considerations")
_sec([53, 54], "Comments")
_sec([57, 58, 59, 60, 61], "Patient Population")
_sec([63, 64, 65, 66, 67], "Site Experience and Staffing")
_sec([69], "Equipment and Facility Requirements")
_sec([71, 72, 73, 74, 75], "Institutional Review Board, Ethics Committee, and Contracts")
# Plné znění otázek, které Qualtrics v hlavičce CSV ořezává "..." (maticové otázky).
# Zdroj: prázdný SIPIQ PDF (ICONIC ... _SipIQ_V1_13MAY2026.pdf).
STEM_OVERRIDE = {
"Q31": "At your site, at what line(s) of treatment do you most commonly prescribe "
"vedolizumab for patients with moderately to severely active ulcerative colitis?",
"Q63": "Do you or your site staff have experience in performing the following types of "
"study assessments/procedures?",
"Q64": "The following personnel are required to run the study. "
"Will your site have the following available?",
"Q69": "The following equipment and facilities are required to run the studies. "
"Are these available at your site?",
}
def now_iso():
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
def strip_accents(s):
if not s:
return ""
nfkd = unicodedata.normalize("NFKD", s)
return "".join(c for c in nfkd if not unicodedata.combining(c))
def norm_name(s):
return re.sub(r"\s+", " ", strip_accents(s or "").lower()).strip()
def sanitize_key(qcode):
"""Qcode -> klíč do answers{} (MongoDB-safe): '#' a '.' -> '_'."""
return qcode.replace("#", "_").replace(".", "_")
def qnum(qcode):
"""Číslo otázky z Qcode (Q63#1_2 -> 63, Q40_6_TEXT -> 40)."""
m = re.match(r"Q(\d+)", qcode)
return int(m.group(1)) if m else None
def qbase(qcode):
"""Logická báze otázky (Q63#1_2 -> Q63, Q40_6 -> Q40, Q25 -> Q25)."""
m = re.match(r"(Q\d+)", qcode)
return m.group(1) if m else qcode
def import_id(h3_cell):
try:
return json.loads(h3_cell).get("ImportId", "")
except Exception:
return h3_cell
def split_text(text):
"""Vrátí (stem, item_label). Stem = text otázky, item_label = popisek podčásti."""
parts = [p.strip() for p in re.split(r"\s+-\s+", text)]
stem = parts[0]
if len(parts) == 1:
return stem, None
# poslední část = popisek řádku/části; vyčisti Qualtrics artefakty
label_parts = parts[1:]
# zahodit "Selected Choice" (artefakt single-choice s Other)
label_parts = [p for p in label_parts if p.lower() != "selected choice"]
# zahodit interní statement kód typu "Q63#1"
label_parts = [p for p in label_parts if not re.fullmatch(r"Q\d+#\d+", p)]
label = " - ".join(label_parts) if label_parts else None
return stem, label
def detect_type(qcode, observed):
"""Heuristika typu otázky z Qcode a pozorovaných hodnot."""
has_hash = "#" in qcode
vals = [v for v in observed if v]
yesno = vals and all(v in ("Yes", "No") for v in vals)
numeric = vals and all(re.fullmatch(r"-?\d+(\.\d+)?", v) for v in vals)
if has_hash and yesno:
return "matrix_yesno"
if has_hash and numeric:
return "matrix_percent"
if has_hash:
return "matrix"
if numeric:
return "numeric"
if yesno:
return "yesno"
return "single_or_text"
# ---------------------------------------------------------------------------
def load_csv(path):
with open(path, encoding="utf-8-sig", newline="") as fh:
rows = list(csv.reader(fh))
h1, h2, h3 = rows[0], rows[1], rows[2]
data = rows[3:]
cols = []
for i, (code, text, imp) in enumerate(zip(h1, h2, h3)):
cols.append({"i": i, "code": code, "text": text, "qid": import_id(imp)})
return cols, data
def col_getter(cols, data):
idx = {c["code"]: c["i"] for c in cols}
def get(row, code):
i = idx.get(code)
return (row[i].strip() if i is not None and i < len(row) else "")
return get, idx
def is_question_col(code):
return bool(re.match(r"Q\d", code))
# ---------------------------------------------------------------------------
def build_questions(cols, data):
"""Slovník otázek -> list dokumentů (1 = 1 logická otázka)."""
# observed hodnoty per Qcode (pro typ + options)
qcols = [c for c in cols if is_question_col(c["code"])]
observed = {c["code"]: set() for c in qcols}
for row in data:
for c in qcols:
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
if v:
observed[c["code"]].add(v)
groups = {} # base -> dict
order_seen = []
for c in qcols:
base = qbase(c["code"])
if base not in groups:
groups[base] = {
"_id": base,
"order": c["i"],
"qnum": qnum(c["code"]),
"section": SECTION_BY_QNUM.get(qnum(c["code"]), "Other"),
"qids": [],
"text": split_text(c["text"])[0],
"items": [],
"_obs": set(),
"_types": [],
}
order_seen.append(base)
g = groups[base]
base_qid = re.match(r"(QID\d+)", c["qid"] or "")
if base_qid and base_qid.group(1) not in g["qids"]:
g["qids"].append(base_qid.group(1))
stem, label = split_text(c["text"])
key = sanitize_key(c["code"])
item = {"key": key, "qcode": c["code"], "qid": c["qid"]}
if label:
item["label"] = label
g["items"].append(item)
g["_obs"] |= observed[c["code"]]
g["_types"].append(detect_type(c["code"], observed[c["code"]]))
out = []
for n, base in enumerate(order_seen):
g = groups[base]
obs = sorted(g.pop("_obs"))
types = g.pop("_types")
# typ skupiny: nejčastější netriviální
gtype = max(set(types), key=types.count) if types else "single_or_text"
g["type"] = gtype
# options jen u kategorických (yesno/single)
if gtype in ("yesno", "matrix_yesno"):
g["options"] = ["Yes", "No"]
elif gtype == "single_or_text" and obs and len(obs) <= 12:
g["options"] = obs
else:
g["options"] = []
if base in STEM_OVERRIDE:
g["text"] = STEM_OVERRIDE[base]
g["order"] = n # přečíslovat 0..N dle pořadí v CSV
# pokud má jen 1 item bez labelu, items vynech (je to prostá otázka)
if len(g["items"]) == 1 and "label" not in g["items"][0]:
g["items"] = []
out.append(g)
return out
# ---------------------------------------------------------------------------
def build_response(cols, get, row, source_file):
rid = get(row, "ResponseId")
answers = {}
for c in cols:
if is_question_col(c["code"]):
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
if v:
answers[sanitize_key(c["code"])] = v
def g(*names):
for nm in names:
v = get(row, nm)
if v:
return v
return None
meta = {
"start_date": get(row, "StartDate") or None,
"end_date": get(row, "EndDate") or None,
"recorded_date": get(row, "RecordedDate") or None,
"status": get(row, "Status") or None,
"progress": int(get(row, "Progress")) if get(row, "Progress").isdigit() else get(row, "Progress") or None,
"finished": get(row, "Finished") in ("True", "1", "TRUE"),
"duration_sec": int(get(row, "Duration (in seconds)")) if get(row, "Duration (in seconds)").isdigit() else None,
"user_language": get(row, "UserLanguage") or None,
"distribution_channel": get(row, "DistributionChannel") or None,
"ip_address": get(row, "IPAddress") or None,
"location_lat": get(row, "LocationLatitude") or None,
"location_lng": get(row, "LocationLongitude") or None,
"survey_date": get(row, "Date") or None,
"survey_time": get(row, "Time") or None,
}
doc = {
"_id": rid,
"study": "77242113UCO3002",
"site_country": get(row, "site_country") or None,
"site_name": get(row, "site_name") or None,
"site_city": get(row, "site_city") or None,
"site_state": get(row, "site_state") or None,
"site_postcode": get(row, "site_postcode") or None,
"site_address": get(row, "site_address") or None,
"pi_first_name": get(row, "pi_first_name") or None,
"pi_last_name": get(row, "pi_last_name") or None,
"pi_email": (get(row, "pi_email") or "").lower() or None,
"pi_phone": get(row, "pi_phone") or None,
"sdl_site_id": get(row, "sdl_site_id") or None,
"fire_site_id": get(row, "fire_site_id") or None,
"fire_investigator_id": get(row, "fire_investigator_id") or None,
"mailinglist_id": get(row, "mailinglist_id") or None,
"survey_generated_by": get(row, "survey_generated_by") or None,
"recipient_email": (get(row, "RecipientEmail") or "").lower() or None,
"recipient_last_name": get(row, "RecipientLastName") or None,
"recipient_first_name": get(row, "RecipientFirstName") or None,
"meta": meta,
"is_full_sipiq": any(k.startswith(("Q57", "Q58", "Q59", "Q63", "Q66", "Q71")) for k in answers),
"interested": answers.get("Q25"),
"answers": answers,
"investigator_oid": None,
"investigator_match": None,
"source_file": source_file,
}
return doc
def content_hash(doc):
payload = {k: doc[k] for k in doc if k not in
("content_sha256", "first_imported_at", "last_seen_at", "last_updated_at", "history",
"investigator_oid", "investigator_match", "source_file")}
blob = json.dumps(payload, sort_keys=True, ensure_ascii=False, default=str)
return hashlib.sha256(blob.encode("utf-8")).hexdigest()
# ---------------------------------------------------------------------------
def load_investigators(db):
inv = list(db.investigators.find(
{"zeme": {"$in": ["Czech Republic", "Slovakia"]}},
{"prijmeni": 1, "jmeno": 1, "email": 1, "email2": 1, "zeme": 1, "KROK": 1, "pracoviste": 1},
))
by_email = {}
by_name = {}
for d in inv:
for ef in ("email", "email2"):
e = (d.get(ef) or "").lower().strip()
if e:
by_email.setdefault(e, d)
nm = norm_name(d.get("prijmeni"))
if nm:
by_name.setdefault((nm, d.get("zeme")), []).append(d)
return inv, by_email, by_name
def soft_link(doc, by_email, by_name):
e = (doc.get("pi_email") or "").lower().strip()
if e and e in by_email:
d = by_email[e]
return d["_id"], f"email:{e}", d
e2 = (doc.get("recipient_email") or "").lower().strip()
if e2 and e2 in by_email:
d = by_email[e2]
return d["_id"], f"recipient_email:{e2}", d
nm = norm_name(doc.get("pi_last_name"))
cand = by_name.get((nm, doc.get("site_country")), [])
if len(cand) == 1:
return cand[0]["_id"], f"prijmeni:{nm}", cand[0]
if len(cand) > 1:
return None, f"prijmeni_ambiguous:{nm}({len(cand)})", None
return None, "NENALEZENO", None
# ---------------------------------------------------------------------------
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--csv", required=True)
ap.add_argument("--scope", choices=["czsk", "all"], default="czsk")
ap.add_argument("--apply", action="store_true", help="ostrý zápis (jinak dry-run)")
ap.add_argument("--dry-run", action="store_true")
args = ap.parse_args()
dry = not args.apply
source_file = args.csv.replace("\\", "/").split("/")[-1]
cols, data = load_csv(args.csv)
get, idx = col_getter(cols, data)
# filtr rozsahu
if args.scope == "czsk":
data = [r for r in data if get(r, "site_country") in ("Czech Republic", "Slovakia")]
print(f"Zdroj: {source_file} | rozsah={args.scope} | odpovědí k importu: {len(data)}")
# --- slovník otázek (staví se z PLNÉHO CSV, ne jen scope) ---
cols_all, data_all = load_csv(args.csv)
questions = build_questions(cols_all, data_all)
print(f"Slovník otázek: {len(questions)} logických otázek "
f"(z toho {sum(1 for q in questions if q['items'])} vícedílných).")
# --- Mongo ---
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=8000)
db = client[DB_NAME]
client.admin.command("ping")
inv, by_email, by_name = load_investigators(db)
print(f"Investigatorů CZ+SK v DB: {len(inv)}")
# --- response dokumenty + soft-link ---
docs = []
link_rows = []
for r in data:
doc = build_response(cols, get, r, source_file)
oid, how, matched = soft_link(doc, by_email, by_name)
doc["investigator_oid"] = oid
doc["investigator_match"] = how
doc["content_sha256"] = content_hash(doc)
docs.append(doc)
link_rows.append((doc, how, matched))
# --- delta proti DB ---
existing = {d["_id"]: d for d in db[COL_R].find({}, {"content_sha256": 1})}
to_insert = [d for d in docs if d["_id"] not in existing]
to_update, unchanged = [], []
for d in docs:
if d["_id"] in existing:
if existing[d["_id"]].get("content_sha256") != d["content_sha256"]:
to_update.append(d)
else:
unchanged.append(d)
# ===================== REPORT =====================
print("\n=== SOFT-LINK na investigators ===")
matched_k7 = matched_other = unmatched = 0
for doc, how, m in link_rows:
krok = (m or {}).get("KROK", "")
tag = "" if m else ""
if m and str(krok).startswith("7"):
matched_k7 += 1
elif m:
matched_other += 1
else:
unmatched += 1
print(f" {tag} {doc.get('site_country','?')[:2]} {str(doc.get('pi_last_name'))[:18]:18} "
f"{str(doc.get('pi_email'))[:32]:32} -> {how[:40]:40} {('KROK '+str(krok)) if m else ''}")
print(f" Souhrn: napárováno KROK7={matched_k7}, jiný KROK={matched_other}, nenapárováno={unmatched}")
print("\n=== DELTA ===")
print(f" INSERT (nové): {len(to_insert)}")
print(f" UPDATE (změněné): {len(to_update)}")
print(f" beze změny: {len(unchanged)}")
# ukázka 1 dokumentu
if docs:
s = dict(docs[0])
s["answers"] = {k: s["answers"][k] for k in list(s["answers"])[:6]}
s["answers"][""] = f"(+{len(docs[0]['answers'])-6} dalších)"
print("\n=== UKÁZKA response dokumentu (zkráceno) ===")
print(json.dumps(s, ensure_ascii=False, indent=2, default=str)[:1800])
if dry:
print("\n[DRY-RUN] Nic se nezapsalo. Ostrý běh: přidej --apply")
client.close()
return
# ===================== ZÁPIS =====================
# 1) slovník otázek (idempotentní upsert)
nq = 0
for q in questions:
db[COL_Q].replace_one({"_id": q["_id"]}, q, upsert=True)
nq += 1
print(f"\n[APPLY] sipiq_questions: upsertnuto {nq}")
# 2) responses (delta)
ts = now_iso()
ni = nu = ns = 0
for d in docs:
cur = db[COL_R].find_one({"_id": d["_id"]})
if cur is None:
d["first_imported_at"] = ts
d["last_seen_at"] = ts
d["last_updated_at"] = ts
d["history"] = []
db[COL_R].insert_one(d)
ni += 1
elif cur.get("content_sha256") != d["content_sha256"]:
changes = diff_docs(cur, d)
db[COL_R].update_one({"_id": d["_id"]}, {
"$set": {**{k: d[k] for k in d if k not in ("_id",)},
"last_seen_at": ts, "last_updated_at": ts},
"$push": {"history": {"changed_at": ts, "source_file": source_file, "changes": changes}},
})
nu += 1
else:
db[COL_R].update_one({"_id": d["_id"]},
{"$set": {"last_seen_at": ts, "source_file": source_file}})
ns += 1
print(f"[APPLY] sipiq_responses: insert={ni}, update={nu}, beze změny={ns}")
client.close()
def diff_docs(old, new):
"""Field-level diff pro history (jen answers + povýšená pole + meta)."""
changes = []
def walk(prefix, o, n):
keys = set((o or {}).keys()) | set((n or {}).keys())
for k in sorted(keys):
ov, nv = (o or {}).get(k), (n or {}).get(k)
if isinstance(ov, dict) or isinstance(nv, dict):
walk(f"{prefix}{k}.", ov or {}, nv or {})
elif ov != nv:
changes.append({"key": f"{prefix}{k}", "old": ov, "new": nv})
for field in ("answers", "meta"):
walk(f"{field}.", old.get(field, {}), new.get(field, {}))
for k in ("site_name", "pi_email", "pi_last_name", "interested", "is_full_sipiq"):
if old.get(k) != new.get(k):
changes.append({"key": k, "old": old.get(k), "new": new.get(k)})
return changes
if __name__ == "__main__":
main()
+47
View File
@@ -0,0 +1,47 @@
# sipiq_import_v1.1 — import SIPIQ odpovědí do MongoDB (folder workflow)
**Verze:** 1.1 · **Datum:** 2026-06-17 · **Studie:** 77242113UCO3002 (ICONIC / DAWN)
## Změny proti v1.0
- **FOLDER WORKFLOW** (`--folder`): sebere všechna `*.csv` ve složce, naimportuje (delta)
a po úspěšném zpracování **přesune soubor do podsložky `Zpracováno`**.
Default složka = `U:\PythonProject\Janssen\Feasibility\77242113UCO2001\ImportSIPIQcompled`.
Vzor Incoming/Processed (jako IWRS / Panorama). Stará v1.0 → `Feasibility\TRASH`.
## Účel a kolekce
(stejné jako v1.0) Import Qualtrics CSV exportu do db `feasibility`:
- `sipiq_questions` — slovník dotazníku (rekonstrukce SIPIQ jako v PDF).
- `sipiq_responses` — 1 dok = 1 odpověď (`_id`=ResponseId), ploché `answers{}`,
soft-link `investigator_oid`, delta + `history[]`.
Zdroj = CSV (ř.1 Qcode, ř.2 text otázky, ř.3 ImportId=QID). XML neobsahuje text otázky.
## Delta import (přepíše JEN změněná data)
nová→INSERT; beze změn (shodný `content_sha256`)→jen `last_seen_at`;
změna→`$set` jen změněných polí + `$push` do `history[]`.
## Soft-link na investigators (nedestruktivní)
pi_email → email/email2 (lower), pak recipient_email, fallback příjmení (bez diakritiky)+země.
## Použití
```
# folder režim (default složka): zpracuje vše a přesune do Zpracováno
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.1.py --dry-run
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.1.py --apply
# jiná složka
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.1.py --folder "<cesta>" --apply
# jediný soubor (NEpřesouvá)
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.1.py --csv "<cesta.csv>" --apply
```
`--scope czsk` (default) / `all`. Default = dry-run, ostrý = `--apply`.
Přesun do `Zpracováno` proběhne JEN v `--apply` a JEN ve folder režimu (ne u `--csv`).
Kolize jmen v Zpracováno → přípona `_N`.
## Workflow (domluva 17JUN2026)
Uživatel pokládá kompletní SIPIQ reporty (Qualtrics CSV) do `ImportSIPIQcompled\`.
Po zpracování skript přesune soubor do `ImportSIPIQcompled\Zpracováno\`. Delta zajistí,
že opakovaný/rozšířený export jen doplní nové/změněné odpovědi (zbytek beze změny).
## Stav 17JUN2026
Folder + Zpracováno připraveny. Iniciální import (15 CZ+SK z 06.06 exportu) proveden ještě v1.0:
`sipiq_questions`:56, `sipiq_responses`:15, soft-link 15/15 přes e-mail = KROK 7.
+480
View File
@@ -0,0 +1,480 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
sipiq_import_v1.1.py
====================
Verze: 1.1
Datum: 2026-06-17
Autor: Claude Code (pro MUDr. Vladimíra Buzalku)
Změny proti v1.0
----------------
- FOLDER WORKFLOW: režim --folder sebere všechna *.csv ve složce, naimportuje (delta)
a po úspěšném zpracování přesune soubor do podsložky `Zpracováno`. Default složka =
U:\\PythonProject\\Janssen\\Feasibility\\77242113UCO2001\\ImportSIPIQcompled.
(Vzor Incoming/Processed jako IWRS / Panorama.) Stará v1.0 ponechána v TRASH.
Popis
-----
Import SIPIQ odpovědí (Qualtrics CSV export, studie 77242113UCO3002 / ICONIC DAWN)
do MongoDB db `feasibility`. Dvě kolekce:
* sipiq_questions slovník dotazníku (1 dok = 1 logická otázka).
* sipiq_responses 1 dok = 1 odpověď (_id = Qualtrics ResponseId), ploché answers{},
soft-link investigator_oid, delta bookkeeping + history[].
DELTA import (přepíše JEN změněná data): nová->insert; beze změn->jen last_seen_at;
změna->$set jen změněných polí + push do history[].
Použití
-------
# folder režim (default složka): zpracuje vše a přesune do Zpracováno
python sipiq_import_v1.1.py --dry-run
python sipiq_import_v1.1.py --apply
# konkrétní složka
python sipiq_import_v1.1.py --folder "<cesta>" --apply
# jediný soubor (NEpřesouvá)
python sipiq_import_v1.1.py --csv "<cesta.csv>" --apply
Závislosti: pymongo (.venv). Mongo 192.168.1.76:27017, bez auth.
"""
import argparse
import csv
import glob
import hashlib
import json
import os
import re
import shutil
import sys
import unicodedata
from datetime import datetime, timezone
try:
from pymongo import MongoClient
except ImportError:
print("CHYBA: pymongo není nainstalován v aktuálním pythonu.", file=sys.stderr)
raise
MONGO_URI = "mongodb://192.168.1.76:27017"
DB_NAME = "feasibility"
COL_Q = "sipiq_questions"
COL_R = "sipiq_responses"
DEFAULT_FOLDER = r"U:\PythonProject\Janssen\Feasibility\77242113UCO2001\ImportSIPIQcompled"
PROCESSED_SUBDIR = "Zpracováno"
META_COLS = {
"StartDate", "EndDate", "Status", "IPAddress", "Progress", "Duration (in seconds)",
"Finished", "RecordedDate", "ResponseId", "RecipientLastName", "RecipientFirstName",
"RecipientEmail", "ExternalReference", "LocationLatitude", "LocationLongitude",
"DistributionChannel", "UserLanguage",
}
PROMOTE = [
"site_name", "site_address", "site_city", "site_state", "site_postcode", "site_country",
"pi_first_name", "pi_last_name", "pi_phone", "pi_email",
"sdl_site_id", "fire_site_id", "fire_investigator_id", "mailinglist_id",
"survey_generated_by", "Date", "Time",
]
SECTION_BY_QNUM = {}
def _sec(rng, name):
for n in rng:
SECTION_BY_QNUM[n] = name
_sec([2], "J&J Internal Assessment")
_sec([6, 7, 8, 9, 10, 11, 12, 13], "Contact Information")
_sec(range(14, 22), "Confidentiality Statement")
_sec([25, 26, 27], "Interest")
_sec([29, 30, 31, 32, 33, 34], "Protocol Requirements")
_sec([36, 37, 38], "Enrollment")
_sec([40, 41, 42, 43], "Patient Demographics Overview")
_sec([45, 46, 47, 48, 49], "Site Overview")
_sec([51], "Operational Considerations")
_sec([53, 54], "Comments")
_sec([57, 58, 59, 60, 61], "Patient Population")
_sec([63, 64, 65, 66, 67], "Site Experience and Staffing")
_sec([69], "Equipment and Facility Requirements")
_sec([71, 72, 73, 74, 75], "Institutional Review Board, Ethics Committee, and Contracts")
STEM_OVERRIDE = {
"Q31": "At your site, at what line(s) of treatment do you most commonly prescribe "
"vedolizumab for patients with moderately to severely active ulcerative colitis?",
"Q63": "Do you or your site staff have experience in performing the following types of "
"study assessments/procedures?",
"Q64": "The following personnel are required to run the study. "
"Will your site have the following available?",
"Q69": "The following equipment and facilities are required to run the studies. "
"Are these available at your site?",
}
def now_iso():
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
def strip_accents(s):
if not s:
return ""
return "".join(c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c))
def norm_name(s):
return re.sub(r"\s+", " ", strip_accents(s or "").lower()).strip()
def sanitize_key(qcode):
return qcode.replace("#", "_").replace(".", "_")
def qnum(qcode):
m = re.match(r"Q(\d+)", qcode)
return int(m.group(1)) if m else None
def qbase(qcode):
m = re.match(r"(Q\d+)", qcode)
return m.group(1) if m else qcode
def import_id(h3_cell):
try:
return json.loads(h3_cell).get("ImportId", "")
except Exception:
return h3_cell
def split_text(text):
parts = [p.strip() for p in re.split(r"\s+-\s+", text)]
stem = parts[0]
if len(parts) == 1:
return stem, None
label_parts = [p for p in parts[1:] if p.lower() != "selected choice"]
label_parts = [p for p in label_parts if not re.fullmatch(r"Q\d+#\d+", p)]
return stem, (" - ".join(label_parts) if label_parts else None)
def detect_type(qcode, observed):
has_hash = "#" in qcode
vals = [v for v in observed if v]
yesno = vals and all(v in ("Yes", "No") for v in vals)
numeric = vals and all(re.fullmatch(r"-?\d+(\.\d+)?", v) for v in vals)
if has_hash and yesno:
return "matrix_yesno"
if has_hash and numeric:
return "matrix_percent"
if has_hash:
return "matrix"
if numeric:
return "numeric"
if yesno:
return "yesno"
return "single_or_text"
def load_csv(path):
with open(path, encoding="utf-8-sig", newline="") as fh:
rows = list(csv.reader(fh))
h1, h2, h3 = rows[0], rows[1], rows[2]
data = rows[3:]
cols = [{"i": i, "code": c, "text": t, "qid": import_id(j)}
for i, (c, t, j) in enumerate(zip(h1, h2, h3))]
return cols, data
def col_getter(cols, data):
idx = {c["code"]: c["i"] for c in cols}
def get(row, code):
i = idx.get(code)
return (row[i].strip() if i is not None and i < len(row) else "")
return get, idx
def is_question_col(code):
return bool(re.match(r"Q\d", code))
def build_questions(cols, data):
qcols = [c for c in cols if is_question_col(c["code"])]
observed = {c["code"]: set() for c in qcols}
for row in data:
for c in qcols:
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
if v:
observed[c["code"]].add(v)
groups, order_seen = {}, []
for c in qcols:
base = qbase(c["code"])
if base not in groups:
groups[base] = {"_id": base, "order": c["i"], "qnum": qnum(c["code"]),
"section": SECTION_BY_QNUM.get(qnum(c["code"]), "Other"),
"qids": [], "text": split_text(c["text"])[0],
"items": [], "_obs": set(), "_types": []}
order_seen.append(base)
g = groups[base]
bq = re.match(r"(QID\d+)", c["qid"] or "")
if bq and bq.group(1) not in g["qids"]:
g["qids"].append(bq.group(1))
_, label = split_text(c["text"])
item = {"key": sanitize_key(c["code"]), "qcode": c["code"], "qid": c["qid"]}
if label:
item["label"] = label
g["items"].append(item)
g["_obs"] |= observed[c["code"]]
g["_types"].append(detect_type(c["code"], observed[c["code"]]))
out = []
for n, base in enumerate(order_seen):
g = groups[base]
obs = sorted(g.pop("_obs"))
types = g.pop("_types")
gtype = max(set(types), key=types.count) if types else "single_or_text"
g["type"] = gtype
if gtype in ("yesno", "matrix_yesno"):
g["options"] = ["Yes", "No"]
elif gtype == "single_or_text" and obs and len(obs) <= 12:
g["options"] = obs
else:
g["options"] = []
if base in STEM_OVERRIDE:
g["text"] = STEM_OVERRIDE[base]
g["order"] = n
if len(g["items"]) == 1 and "label" not in g["items"][0]:
g["items"] = []
out.append(g)
return out
def build_response(cols, get, row, source_file):
rid = get(row, "ResponseId")
answers = {}
for c in cols:
if is_question_col(c["code"]):
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
if v:
answers[sanitize_key(c["code"])] = v
meta = {
"start_date": get(row, "StartDate") or None,
"end_date": get(row, "EndDate") or None,
"recorded_date": get(row, "RecordedDate") or None,
"status": get(row, "Status") or None,
"progress": int(get(row, "Progress")) if get(row, "Progress").isdigit() else (get(row, "Progress") or None),
"finished": get(row, "Finished") in ("True", "1", "TRUE"),
"duration_sec": int(get(row, "Duration (in seconds)")) if get(row, "Duration (in seconds)").isdigit() else None,
"user_language": get(row, "UserLanguage") or None,
"distribution_channel": get(row, "DistributionChannel") or None,
"ip_address": get(row, "IPAddress") or None,
"location_lat": get(row, "LocationLatitude") or None,
"location_lng": get(row, "LocationLongitude") or None,
"survey_date": get(row, "Date") or None,
"survey_time": get(row, "Time") or None,
}
doc = {
"_id": rid, "study": "77242113UCO3002",
"site_country": get(row, "site_country") or None,
"site_name": get(row, "site_name") or None,
"site_city": get(row, "site_city") or None,
"site_state": get(row, "site_state") or None,
"site_postcode": get(row, "site_postcode") or None,
"site_address": get(row, "site_address") or None,
"pi_first_name": get(row, "pi_first_name") or None,
"pi_last_name": get(row, "pi_last_name") or None,
"pi_email": (get(row, "pi_email") or "").lower() or None,
"pi_phone": get(row, "pi_phone") or None,
"sdl_site_id": get(row, "sdl_site_id") or None,
"fire_site_id": get(row, "fire_site_id") or None,
"fire_investigator_id": get(row, "fire_investigator_id") or None,
"mailinglist_id": get(row, "mailinglist_id") or None,
"survey_generated_by": get(row, "survey_generated_by") or None,
"recipient_email": (get(row, "RecipientEmail") or "").lower() or None,
"recipient_last_name": get(row, "RecipientLastName") or None,
"recipient_first_name": get(row, "RecipientFirstName") or None,
"meta": meta,
"is_full_sipiq": any(k.startswith(("Q57", "Q58", "Q59", "Q63", "Q66", "Q71")) for k in answers),
"interested": answers.get("Q25"),
"answers": answers,
"investigator_oid": None, "investigator_match": None,
"source_file": source_file,
}
return doc
def content_hash(doc):
payload = {k: doc[k] for k in doc if k not in
("content_sha256", "first_imported_at", "last_seen_at", "last_updated_at",
"history", "investigator_oid", "investigator_match", "source_file")}
return hashlib.sha256(json.dumps(payload, sort_keys=True, ensure_ascii=False,
default=str).encode("utf-8")).hexdigest()
def load_investigators(db):
inv = list(db.investigators.find(
{"zeme": {"$in": ["Czech Republic", "Slovakia"]}},
{"prijmeni": 1, "jmeno": 1, "email": 1, "email2": 1, "zeme": 1, "KROK": 1}))
by_email, by_name = {}, {}
for d in inv:
for ef in ("email", "email2"):
e = (d.get(ef) or "").lower().strip()
if e:
by_email.setdefault(e, d)
nm = norm_name(d.get("prijmeni"))
if nm:
by_name.setdefault((nm, d.get("zeme")), []).append(d)
return inv, by_email, by_name
def soft_link(doc, by_email, by_name):
e = (doc.get("pi_email") or "").lower().strip()
if e and e in by_email:
d = by_email[e]; return d["_id"], f"email:{e}", d
e2 = (doc.get("recipient_email") or "").lower().strip()
if e2 and e2 in by_email:
d = by_email[e2]; return d["_id"], f"recipient_email:{e2}", d
nm = norm_name(doc.get("pi_last_name"))
cand = by_name.get((nm, doc.get("site_country")), [])
if len(cand) == 1:
return cand[0]["_id"], f"prijmeni:{nm}", cand[0]
if len(cand) > 1:
return None, f"prijmeni_ambiguous:{nm}({len(cand)})", None
return None, "NENALEZENO", None
def diff_docs(old, new):
changes = []
def walk(prefix, o, n):
for k in sorted(set((o or {}).keys()) | set((n or {}).keys())):
ov, nv = (o or {}).get(k), (n or {}).get(k)
if isinstance(ov, dict) or isinstance(nv, dict):
walk(f"{prefix}{k}.", ov or {}, nv or {})
elif ov != nv:
changes.append({"key": f"{prefix}{k}", "old": ov, "new": nv})
for field in ("answers", "meta"):
walk(f"{field}.", old.get(field, {}), new.get(field, {}))
for k in ("site_name", "pi_email", "pi_last_name", "interested", "is_full_sipiq"):
if old.get(k) != new.get(k):
changes.append({"key": k, "old": old.get(k), "new": new.get(k)})
return changes
# ---------------------------------------------------------------------------
def process_file(db, csv_path, scope, dry, by_email, by_name):
source_file = os.path.basename(csv_path)
cols, data = load_csv(csv_path)
get, _ = col_getter(cols, data)
if scope == "czsk":
data = [r for r in data if get(r, "site_country") in ("Czech Republic", "Slovakia")]
print(f"\n########## {source_file} (rozsah={scope}, odpovědí={len(data)}) ##########")
# slovník z plného CSV
cols_all, data_all = load_csv(csv_path)
questions = build_questions(cols_all, data_all)
docs, link_rows = [], []
for r in data:
doc = build_response(cols, get, r, source_file)
oid, how, matched = soft_link(doc, by_email, by_name)
doc["investigator_oid"] = oid
doc["investigator_match"] = how
doc["content_sha256"] = content_hash(doc)
docs.append(doc)
link_rows.append((doc, how, matched))
existing = {d["_id"]: d for d in db[COL_R].find({}, {"content_sha256": 1})}
to_insert = [d for d in docs if d["_id"] not in existing]
to_update = [d for d in docs if d["_id"] in existing and existing[d["_id"]].get("content_sha256") != d["content_sha256"]]
unchanged = [d for d in docs if d["_id"] in existing and existing[d["_id"]].get("content_sha256") == d["content_sha256"]]
mk7 = mko = un = 0
for doc, how, m in link_rows:
krok = (m or {}).get("KROK", "")
if m and str(krok).startswith("7"): mk7 += 1
elif m: mko += 1
else: un += 1
print(f" slovník: {len(questions)} otázek | soft-link: KROK7={mk7}, jiný={mko}, nenapárováno={un}")
print(f" delta: INSERT={len(to_insert)}, UPDATE={len(to_update)}, beze změny={len(unchanged)}")
if un:
for doc, how, m in link_rows:
if not m:
print(f" ✗ NENAPÁROVÁNO: {doc.get('pi_last_name')} / {doc.get('pi_email')} ({how})")
if dry:
print(" [DRY-RUN] nezapsáno")
return {"insert": 0, "update": 0, "unchanged": 0, "wrote": False}
for q in questions:
db[COL_Q].replace_one({"_id": q["_id"]}, q, upsert=True)
ts = now_iso()
ni = nu = ns = 0
for d in docs:
cur = db[COL_R].find_one({"_id": d["_id"]})
if cur is None:
d.update({"first_imported_at": ts, "last_seen_at": ts, "last_updated_at": ts, "history": []})
db[COL_R].insert_one(d); ni += 1
elif cur.get("content_sha256") != d["content_sha256"]:
changes = diff_docs(cur, d)
db[COL_R].update_one({"_id": d["_id"]}, {
"$set": {**{k: d[k] for k in d if k != "_id"}, "last_seen_at": ts, "last_updated_at": ts},
"$push": {"history": {"changed_at": ts, "source_file": source_file, "changes": changes}}})
nu += 1
else:
db[COL_R].update_one({"_id": d["_id"]}, {"$set": {"last_seen_at": ts, "source_file": source_file}})
ns += 1
print(f" [APPLY] questions upsert={len(questions)} | responses insert={ni}, update={nu}, beze změny={ns}")
return {"insert": ni, "update": nu, "unchanged": ns, "wrote": True}
def move_to_processed(csv_path, folder):
dest_dir = os.path.join(folder, PROCESSED_SUBDIR)
os.makedirs(dest_dir, exist_ok=True)
base = os.path.basename(csv_path)
dest = os.path.join(dest_dir, base)
if os.path.exists(dest): # kolize -> přípona _N
stem, ext = os.path.splitext(base)
n = 1
while os.path.exists(os.path.join(dest_dir, f"{stem}_{n}{ext}")):
n += 1
dest = os.path.join(dest_dir, f"{stem}_{n}{ext}")
shutil.move(csv_path, dest)
print(f" -> přesunuto do {PROCESSED_SUBDIR}\\{os.path.basename(dest)}")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--csv", help="jediný soubor (NEpřesouvá)")
ap.add_argument("--folder", default=DEFAULT_FOLDER, help="složka se SIPIQ CSV (přesune do Zpracováno)")
ap.add_argument("--scope", choices=["czsk", "all"], default="czsk")
ap.add_argument("--apply", action="store_true")
ap.add_argument("--dry-run", action="store_true")
args = ap.parse_args()
dry = not args.apply
if args.csv:
files, move_mode, folder = [args.csv], False, None
else:
folder = args.folder
files = sorted(glob.glob(os.path.join(folder, "*.csv")))
move_mode = True
print(f"Složka: {folder}\nNalezeno CSV ke zpracování: {len(files)}")
if not files:
print("Nic ke zpracování (žádné *.csv).")
return
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=8000)
db = client[DB_NAME]
client.admin.command("ping")
inv, by_email, by_name = load_investigators(db)
print(f"Investigatorů CZ+SK v DB: {len(inv)}")
total = {"insert": 0, "update": 0, "unchanged": 0}
for f in files:
res = process_file(db, f, args.scope, dry, by_email, by_name)
for k in total:
total[k] += res[k]
if move_mode and res["wrote"]:
move_to_processed(f, folder)
print(f"\n=== CELKEM: insert={total['insert']}, update={total['update']}, beze změny={total['unchanged']} ===")
if dry:
print("[DRY-RUN] Nic se nezapsalo ani nepřesunulo. Ostrý běh: --apply")
client.close()
if __name__ == "__main__":
main()
+32
View File
@@ -0,0 +1,32 @@
# analyze_sent_suspects_v1.0.py
**Verze:** 1.0 · **Datum:** 2026-06-16
Lokální (Z230) analyzátor `.msg` přenesených z JNJ (výstup
`jnj_scan_failed_sent`). Přes **olefile** projde všechny `.msg` ve složce,
u každého vytáhne klíčové MAPI vlastnosti a klasifikuje, zda jde o **neodeslaný**
e-mail. Výstup = přehled do konzole + timestampovaný `.xlsx`.
## Klasifikace
- **FAIL_BODY** — tělo/report obsahuje „could not be sent" / „SendAsDenied" / …
- **SENDAS_BUZ** — send-account / SentRepresenting / Sender obsahuje `buzalka.cz`
- **NO_MSGID** — chybí Internet Message-ID (0x1035)
- `failed = ANO`, pokud FAIL_BODY nebo SENDAS_BUZ (skoro jisté neodeslání).
Vytáhne i **příjemce-lékaře** (externí adresa, ne `its.jnj.com`), subjekt,
send-account a Message-ID. Datum bere z názvu souboru (`..._YYYY-MM-DD_...`).
## Spuštění
```
python analyze_sent_suspects_v1.0.py [SLOZKA_S_MSG]
```
- Bez argumentu použije `INPUT_DIR` (default
`U:\Dropbox\!!!Days\Downloads Z230\sent_suspects`).
- `.xlsx` se uloží do `U:\Dropbox\!!!Days\Downloads Z230\`.
- Vyžaduje `olefile` + `openpyxl` (jsou ve venv `U:\janssen\.venv`).
## Po analýze (další krok)
Seznam příjemců s `failed=ANO` = lékaři, kterým **úvodní nabídka nedorazila**.
Cross-ref na `feasibility.investigators` ukáže, komu (a v jakém KROK) je třeba
poslat nabídku znovu — **se správným From `vbuzalka@its.jnj.com`**.
"""
+196
View File
@@ -0,0 +1,196 @@
# -*- coding: utf-8 -*-
# =============================================================================
# Nazev: analyze_sent_suspects_v1.0.py
# Verze: 1.0
# Datum: 2026-06-16
# Popis: LOKALNI (Z230) analyzator .msg souboru prenesenych z JNJ (vystup
# jnj_scan_failed_sent). Pres olefile precte u kazdeho .msg klicove
# MAPI vlastnosti a klasifikuje, zda jde o NEODESLANY e-mail:
# FAIL_BODY = telo/report obsahuje "could not be sent"/"SendAsDenied"
# SENDAS_BUZ = send-account / sentrep / sender obsahuje "buzalka.cz"
# NO_MSGID = chybi Internet Message-ID (0x1035)
# Vytahne prijemce (externi = lekar), subjekt, send-account, Message-ID.
# Vystup: prehled do konzole + timestampovany .xlsx.
# Pouziti: python analyze_sent_suspects_v1.0.py [SLOZKA_S_MSG]
# (default INPUT_DIR nize). Vyzaduje olefile + openpyxl.
# =============================================================================
import os
import re
import sys
import glob
import datetime
import olefile
import openpyxl
INPUT_DIR = r"U:\Dropbox\!!!Days\Downloads Z230\sent_suspects"
OUT_DIR = r"U:\Dropbox\!!!Days\Downloads Z230"
FAIL_SIGNS = [
"could not be sent", "sendasdenied",
"permission to send the message on behalf",
"transportsend operation has failed", "mapiexceptionsendasdenied",
]
INTERNAL = ("its.jnj.com",) # interni = ne-lekar (vc. cc Kocourkova/Bartosova)
def rd(o, tag):
"""Precti string stream __substg1.0_<tag> (zkousi 001F unicode i 001E ansi)."""
for t in (tag, tag[:-1] + "F", tag[:-1] + "E"):
name = "__substg1.0_" + t
if o.exists(name):
b = o.openstream(name).read()
if t.endswith("001F"):
try:
return b.decode("utf-16-le")
except Exception:
pass
for enc in ("cp1250", "latin-1", "utf-8"):
try:
return b.decode(enc)
except Exception:
pass
return ""
def read_body(o):
txt = rd(o, "1000001F") # PR_BODY
if not txt:
txt = rd(o, "1001001F") # ReportText
# PR_HTML (binary) jako fallback
if not txt and o.exists("__substg1.0_10130102"):
try:
txt = o.openstream("__substg1.0_10130102").read().decode("latin-1", "ignore")
except Exception:
pass
return txt or ""
def recipients_smtp(o):
"""Posbira SMTP vsech prijemcu z __recip_version1.0_#xxxx storages."""
out = []
seen = set()
for entry in o.listdir():
# entry je list segmentu cesty; zajima nas prvni segment recip storage
if entry and entry[0].startswith("__recip_version1.0_#") and len(entry) == 2:
top = entry[0]
if top in seen:
continue
seen.add(top)
smtp = ""
for tag in ("39FE001F", "39FE001E", "3003001F", "3003001E", "0C1F001F"):
nm = top + "/__substg1.0_" + tag
if o.exists(nm):
b = o.openstream(nm).read()
try:
s = b.decode("utf-16-le") if tag.endswith("1F") else b.decode("cp1250")
except Exception:
s = b.decode("latin-1", "ignore")
s = s.strip()
if "@" in s:
smtp = s
break
if smtp:
out.append(smtp)
return out
def analyze_file(path):
o = olefile.OleFileIO(path)
try:
subject = rd(o, "0037001F")
msgid = rd(o, "1035001F")
sendacct = rd(o, "0E28001F")
sentrep = rd(o, "0065001F")
sender = rd(o, "0C1F001F")
body = read_body(o)
recs = recipients_smtp(o)
finally:
o.close()
low = body.lower()
flags = []
if any(s in low for s in FAIL_SIGNS):
flags.append("FAIL_BODY")
joined = " ".join([sendacct, sentrep, sender]).lower()
if "buzalka.cz" in joined:
flags.append("SENDAS_BUZ")
if not msgid:
flags.append("NO_MSGID")
# prijemce-lekar = externi (ne its.jnj.com)
ext = [r for r in recs if not any(d in r.lower() for d in INTERNAL)]
recipient = ext[0] if ext else (recs[0] if recs else "")
# datum z nazvu souboru (STRONG_YYYY-MM-DD_... / weak_YYYY-MM-DD_...)
m = re.search(r"(\d{4}-\d{2}-\d{2})", os.path.basename(path))
date = m.group(1) if m else ""
return {
"file": os.path.basename(path),
"date": date,
"recipient": recipient,
"subject": subject.strip(),
"msgid": msgid.strip(),
"send_account": sendacct.strip(),
"sentrep": sentrep.strip(),
"flags": "+".join(flags),
"failed": "ANO" if ("FAIL_BODY" in flags or "SENDAS_BUZ" in flags) else "?",
}
def main():
indir = sys.argv[1] if len(sys.argv) > 1 else INPUT_DIR
files = sorted(glob.glob(os.path.join(indir, "*.msg")))
if not files:
print("Zadne .msg v:", indir)
return
rows = []
for f in files:
try:
rows.append(analyze_file(f))
except Exception as e:
rows.append({"file": os.path.basename(f), "date": "", "recipient": "",
"subject": "<chyba cteni>", "msgid": "", "send_account": "",
"sentrep": "", "flags": "ERR:" + str(e), "failed": "?"})
# serad: nejdriv jiste selhane, pak dle data
rows.sort(key=lambda r: (r["failed"] != "ANO", r["date"]))
n_fail = sum(1 for r in rows if r["failed"] == "ANO")
n_sendas = sum(1 for r in rows if "SENDAS_BUZ" in r["flags"])
n_failbody = sum(1 for r in rows if "FAIL_BODY" in r["flags"])
n_nomid = sum(1 for r in rows if "NO_MSGID" in r["flags"])
print(f"Souboru: {len(rows)}")
print(f" jiste selhane (FAIL_BODY/SENDAS_BUZ): {n_fail}")
print(f" z toho SENDAS_BUZ (buzalka.cz): {n_sendas} | FAIL_BODY: {n_failbody}")
print(f" jen NO_MSGID (slabe): {n_nomid - n_fail if n_nomid>=n_fail else n_nomid}")
print("=" * 110)
print(f"{'datum':10} {'prijemce':32} {'fail':4} {'flags':22} subjekt")
print("-" * 110)
for r in rows:
print(f"{r['date']:10} {r['recipient'][:32]:32} {r['failed']:4} {r['flags']:22} {r['subject'][:40]}")
# xlsx
wb = openpyxl.Workbook()
ws = wb.active
ws.title = "suspects"
cols = ["file", "date", "recipient", "subject", "msgid", "send_account", "sentrep", "flags", "failed"]
from openpyxl.cell.cell import ILLEGAL_CHARACTERS_RE
def clean(v):
return ILLEGAL_CHARACTERS_RE.sub("", str(v)) if v is not None else ""
ws.append(cols)
for r in rows:
ws.append([clean(r[c]) for c in cols])
stamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
out = os.path.join(OUT_DIR, f"sent_suspects_analyza_{stamp}.xlsx")
wb.save(out)
print("\nXLSX:", out)
if __name__ == "__main__":
main()
+51
View File
@@ -0,0 +1,51 @@
# doplnujici_dotazy_v1.0 — evidence doplňujících dotazů na centra
**Verze:** 1.0 · **Datum:** 2026-06-17 · **Studie:** 77242113UCO3002 (ICONIC / DAWN)
## Účel
Když v SIPIQ chybí odpověď a do dotazníku už NELZE vstoupit, doptáváme se centra zvlášť.
Kolekce `feasibility.doplnujici_dotazy` eviduje, **ke kterému centru a ke které otázce** dotaz
patří a v jakém je stavu. Souvisí s `sipiq_responses` / `sipiq_questions` (viz sipiq_import).
## Model (domluva 17JUN2026)
- **1 dok = dotazová UDÁLOST** (může nést více otázek v `questions[]`).
- Když centrum odpoví → odpověď se **promítne do `sipiq_responses.answers_supplement{}`**
(`{value, source:"doplneno", doplnujici_dotaz_id, answered_at, answer_source}`); původní
Qualtrics `answers` se **NEMĚNÍ**. Analýza/rekonstrukce pak může překrýt answers o answers_supplement.
## Struktura dokumentu
```jsonc
{
"_id": ObjectId,
"response_id": "R_…", // ref sipiq_responses._id
"investigator_oid": ObjectId, // ref investigators
"pi_last_name","site_name","site_country","pi_email", // denormalizace
"status": "open", // open → asked → answered → closed / no_response
"asked_at": null, "asked_via": null, "reason": "…", "note": null,
"questions": [
{"qcode":"Q72_1","question_base":"Q72","question_text":"…","section":"…",
"answer":null,"answered_at":null,"answer_source":null,"status":"open"}
],
"created_at":"…","updated_at":"…","history":[]
}
```
Indexy: `investigator_oid`, `response_id`, `status`, `questions.qcode`, `questions.status`.
## Příkazy
```
.venv\Scripts\python.exe Feasibility\doplnujici_dotazy_v1.0.py ensure
.venv\Scripts\python.exe Feasibility\doplnujici_dotazy_v1.0.py add --center <email|prijmeni|R_id> [--country CZ|SK] \
--qcodes Q72_1,Q73_1 [--reason "…"] [--asked-via "…"] [--status asked] [--note "…"] [--apply]
.venv\Scripts\python.exe Feasibility\doplnujici_dotazy_v1.0.py answer --id <dotaz_id> --qcode Q72_1 \
--answer "8" [--source "email 18JUN2026"] [--apply]
.venv\Scripts\python.exe Feasibility\doplnujici_dotazy_v1.0.py list [--center …] [--open]
```
- `add`/`answer` defaultně **dry-run**, ostrý běh `--apply`.
- `add` dohledá centrum v `sipiq_responses` (R_id / pi_email / příjmení+země) a text+sekci otázky
v `sipiq_questions` (qcode může být leaf, např. Q72_1 → text báze Q72 + popisek item).
- `answer` zapíše odpověď k otázce, přepočítá stav události (answered až když všechny otázky answered)
a promítne do `sipiq_responses.answers_supplement`.
## Stav 17JUN2026
Kolekce + indexy založeny (`ensure`), zatím 0 dokumentů. Dry-run `add` ověřen (Svoboda, Q72_1+Q73_1).
Mongo 192.168.1.76:27017, bez auth, pymongo.
+254
View File
@@ -0,0 +1,254 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
doplnujici_dotazy_v1.0.py
=========================
Verze: 1.0
Datum: 2026-06-17
Autor: Claude Code (pro MUDr. Vladimíra Buzalku)
Popis
-----
Správa kolekce `feasibility.doplnujici_dotazy` — evidence doplňujících dotazů na centra,
když v SIPIQ chybí odpověď a do dotazníku už NELZE vstoupit. Víme tak, ke kterému centru
(a ke které otázce) dotaz patří, a v jakém je stavu.
Model (domluva 17JUN2026): **1 dok = dotazová UDÁLOST** (může nést více otázek v `questions[]`).
Když centrum odpoví, odpověď se PROMÍTNE i do `sipiq_responses.answers_supplement{}`
(s příznakem source="doplneno"); původní Qualtrics `answers` se NEMĚNÍ.
Životní cyklus dotazu: open → asked → answered → closed / no_response.
Příkazy
-------
ensure
Založí kolekci + indexy (idempotentní).
add --center <email|prijmeni|R_id> [--country CZ|SK] --qcodes Q72_1,Q73_1
[--reason ""] [--asked-via ""] [--status asked] [--note ""] [--apply]
Založí novou dotazovou událost. Centrum + otázky se dohledají v sipiq_responses
/ sipiq_questions; identita se denormalizuje. Default dry-run.
answer --id <dotaz_id> --qcode Q72_1 --answer "8" [--source "email 18JUN2026"] [--apply]
Zapíše odpověď k jedné otázce události, promítne do sipiq_responses.answers_supplement,
přepočítá stav události. Default dry-run.
list [--center <email|prijmeni>] [--open]
Vypíše dotazy (volitelně jen otevřené / pro jedno centrum).
Mongo 192.168.1.76:27017, bez auth, pymongo.
"""
import argparse
import re
import sys
from datetime import datetime, timezone
from pymongo import MongoClient, ASCENDING
from bson import ObjectId
MONGO_URI = "mongodb://192.168.1.76:27017"
DB = "feasibility"
COL = "doplnujici_dotazy"
COL_R = "sipiq_responses"
COL_Q = "sipiq_questions"
OPEN_STATES = ("open", "asked")
def now_iso():
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
def qbase(qcode):
m = re.match(r"(Q\d+)", qcode)
return m.group(1) if m else qcode
def db_conn():
c = MongoClient(MONGO_URI, serverSelectionTimeoutMS=8000)
c.admin.command("ping")
return c, c[DB]
def ensure(db):
db[COL].create_index([("investigator_oid", ASCENDING)])
db[COL].create_index([("response_id", ASCENDING)])
db[COL].create_index([("status", ASCENDING)])
db[COL].create_index([("questions.qcode", ASCENDING)])
db[COL].create_index([("questions.status", ASCENDING)])
print(f"OK: kolekce '{COL}' + indexy připraveny. Dokumentů: {db[COL].count_documents({})}")
def find_center(db, key, country=None):
"""Najde sipiq_responses dle ResponseId / pi_email / příjmení."""
if key.startswith("R_"):
d = db[COL_R].find_one({"_id": key})
if d:
return d
d = db[COL_R].find_one({"pi_email": key.lower()})
if d:
return d
flt = {"pi_last_name": re.compile(f"^{re.escape(key)}$", re.I)}
if country:
flt["site_country"] = {"CZ": "Czech Republic", "SK": "Slovakia"}.get(country, country)
cands = list(db[COL_R].find(flt))
if len(cands) == 1:
return cands[0]
if len(cands) > 1:
raise SystemExit(f"CHYBA: '{key}' je nejednoznačné ({len(cands)} center). Upřesni e-mailem nebo --country / R_id.")
raise SystemExit(f"CHYBA: centrum '{key}' nenalezeno v {COL_R}.")
def question_meta(db, qcode):
"""Text + sekce otázky z sipiq_questions (qcode může být leaf, např. Q72_1)."""
base = qbase(qcode)
q = db[COL_Q].find_one({"_id": base})
if not q:
return {"question_base": base, "question_text": None, "section": None}
text = q.get("text")
label = None
for it in q.get("items", []):
if it.get("key") == qcode:
label = it.get("label")
break
full = f"{text}{label}" if label else text
return {"question_base": base, "question_text": full, "section": q.get("section")}
def cmd_add(db, args, dry):
center = find_center(db, args.center, args.country)
qcodes = [q.strip() for q in args.qcodes.split(",") if q.strip()]
questions = []
for qc in qcodes:
meta = question_meta(db, qc)
questions.append({
"qcode": qc, "question_base": meta["question_base"],
"question_text": meta["question_text"], "section": meta["section"],
"answer": None, "answered_at": None, "answer_source": None, "status": "open",
})
ts = now_iso()
doc = {
"response_id": center["_id"],
"investigator_oid": center.get("investigator_oid"),
"pi_last_name": center.get("pi_last_name"),
"site_name": center.get("site_name"),
"site_country": center.get("site_country"),
"pi_email": center.get("pi_email"),
"status": args.status,
"asked_at": ts if args.status == "asked" else None,
"asked_via": args.asked_via,
"reason": args.reason or "neodpovězeno v SIPIQ; dotazník už uzavřen",
"note": args.note,
"questions": questions,
"created_at": ts, "updated_at": ts, "history": [],
}
print(f"Centrum: {doc['pi_last_name']} / {doc['site_name']} ({doc['site_country']}) resp={doc['response_id']}")
for q in questions:
print(f"{q['qcode']:10} [{q['section']}] {q['question_text']}")
if dry:
print("[DRY-RUN] Nezaloženo. Ostrý: --apply")
return
res = db[COL].insert_one(doc)
print(f"[APPLY] Založen dotaz _id={res.inserted_id}")
def cmd_answer(db, args, dry):
doc = db[COL].find_one({"_id": ObjectId(args.id)})
if not doc:
raise SystemExit(f"CHYBA: dotaz _id={args.id} nenalezen.")
qs = doc["questions"]
target = next((q for q in qs if q["qcode"] == args.qcode), None)
if not target:
raise SystemExit(f"CHYBA: otázka {args.qcode} není v tomto dotazu (má: {[q['qcode'] for q in qs]}).")
ts = now_iso()
print(f"Centrum: {doc['pi_last_name']} / {doc['site_name']} resp={doc['response_id']}")
print(f" {args.qcode}: {target.get('answer')!r} -> {args.answer!r} (zdroj: {args.source})")
print(f" + promítnutí do {COL_R}.answers_supplement.{args.qcode}")
if dry:
print("[DRY-RUN] Nezapsáno. Ostrý: --apply")
return
# 1) update otázky v události
for q in qs:
if q["qcode"] == args.qcode:
q["answer"] = args.answer
q["answered_at"] = ts
q["answer_source"] = args.source
q["status"] = "answered"
all_answered = all(q["status"] == "answered" for q in qs)
new_status = "answered" if all_answered else "asked"
db[COL].update_one({"_id": doc["_id"]}, {
"$set": {"questions": qs, "status": new_status, "updated_at": ts},
"$push": {"history": {"changed_at": ts, "action": "answer",
"qcode": args.qcode, "answer": args.answer, "source": args.source}},
})
# 2) promítnout do sipiq_responses.answers_supplement (původní answers NEMĚNÍM)
db[COL_R].update_one({"_id": doc["response_id"]}, {
"$set": {f"answers_supplement.{args.qcode}": {
"value": args.answer, "source": "doplneno",
"doplnujici_dotaz_id": doc["_id"], "answered_at": ts, "answer_source": args.source,
}}
})
print(f"[APPLY] Odpověď zapsána; stav události = {new_status}; promítnuto do {COL_R}.")
def cmd_list(db, args):
flt = {}
if args.open:
flt["status"] = {"$in": list(OPEN_STATES)}
if args.center:
key = args.center
if key.startswith("R_"):
flt["response_id"] = key
elif "@" in key:
flt["pi_email"] = key.lower()
else:
flt["pi_last_name"] = re.compile(f"^{re.escape(key)}$", re.I)
docs = list(db[COL].find(flt).sort("created_at", -1))
print(f"Dotazů: {len(docs)}")
for d in docs:
print(f"\n[{d['_id']}] {d['pi_last_name']} / {d['site_name']} ({d.get('site_country')}) — {d['status']}")
for q in d["questions"]:
a = q.get("answer")
print(f" {q['qcode']:10} {q['status']:9} {('= '+str(a)) if a else '(čeká)'} | {q.get('question_text')}")
def main():
ap = argparse.ArgumentParser()
sub = ap.add_subparsers(dest="cmd", required=True)
sub.add_parser("ensure")
pa = sub.add_parser("add")
pa.add_argument("--center", required=True)
pa.add_argument("--country")
pa.add_argument("--qcodes", required=True)
pa.add_argument("--reason")
pa.add_argument("--asked-via", dest="asked_via")
pa.add_argument("--status", default="open", choices=["open", "asked"])
pa.add_argument("--note")
pa.add_argument("--apply", action="store_true")
pn = sub.add_parser("answer")
pn.add_argument("--id", required=True)
pn.add_argument("--qcode", required=True)
pn.add_argument("--answer", required=True)
pn.add_argument("--source")
pn.add_argument("--apply", action="store_true")
pl = sub.add_parser("list")
pl.add_argument("--center")
pl.add_argument("--open", action="store_true")
args = ap.parse_args()
client, db = db_conn()
try:
if args.cmd == "ensure":
ensure(db)
elif args.cmd == "add":
cmd_add(db, args, dry=not args.apply)
elif args.cmd == "answer":
cmd_answer(db, args, dry=not args.apply)
elif args.cmd == "list":
cmd_list(db, args)
finally:
client.close()
if __name__ == "__main__":
main()
@@ -0,0 +1,48 @@
# jnj_dump_recipient_msgs_v1.0.py
**Verze:** 1.0 · **Datum:** 2026-06-16
JNJ-native (pywin32 / MAPI). Najde **všechny e-maily danému příjemci** (default
Hušták) napříč vybranými složkami, **uloží je jako `.msg`** a u každého **vypíše
diagnostické MAPI vlastnosti čtené ze živé položky**. Účel: ověřit, zda
vlastnosti (GAL jméno, ReportText, send-account, Message-ID…) zůstanou i
v uloženém `.msg` (porovnání olefilem doma).
Skript **nic neodesílá ani nemaže** — jen čte a ukládá `.msg` kopie.
## Spuštění (JNJ stroj s Outlookem)
```
pip install pywin32
python jnj_dump_recipient_msgs_v1.0.py
```
## Co vypíše u každého e-mailu (ze ŽIVÉ položky)
- složka, role (To/Cc), `item.Sent`, `PR_MESSAGE_FLAGS` (0x0E07)
- subjekt, čas odeslání
- **Msg-ID** `0x1035`
- **SenderName** `0x0C1A` + addrtype `0x0C1E`
- **SentRepresentingName** `0x0042` + addrtype `0x0064`
- **PrimarySendAccount** `0x0E28` (odhalí posílání „jako buzalka.cz")
- **ReportText** `0x1001` (NDR „could not be sent…" = selhání)
…a pak položku uloží jako `.msg` do `OUTPUT_DIR`.
## Konfigurace
- `TARGET_EMAIL` — koho hledat (default `rastislav.hustak@fntt.sk`).
- `SCAN_FOLDERS` — názvy složek (vč. podsložek); default Sent Items, Drafts,
Deleted Items, Archive, Inbox. `SCAN_ALL=True` = celá schránka (pomalé).
- `OUTPUT_DIR` — kam ukládat `.msg` (default `C:\Users\vbuzalka\hustak_dump`).
- `SENDER_SMTP` — účet, jehož store se prohledává.
## Po spuštění
1. Porovnej výpis (živé vlastnosti) — uvidíš, který e-mail má GAL jméno /
ReportText / send-account buzalka.cz.
2. Přenes `.msg` z `OUTPUT_DIR` domů (libovolně, např. přes msgreceiver
upload nebo ručně) a olefilem zkontroluj, zda jsou v uloženém `.msg`
stejné vlastnosti jako na živé položce.
## Pozn.
- Match příjemce přes `PR_SMTP_ADDRESS` (0x39FE) → spolehlivě i pro interní
Exchange příjemce.
- `olMSG = 3` (SaveAs typ). Název souboru = index + složka + subjekt + konec
EntryID (kvůli párování).
+188
View File
@@ -0,0 +1,188 @@
# -*- coding: utf-8 -*-
# =============================================================================
# Nazev: jnj_dump_recipient_msgs_v1.0.py
# Verze: 1.0
# Datum: 2026-06-16
# Popis: JNJ-native (MAPI / pywin32). Najde VSECHNY e-maily danemu prijemci
# (default Hustak) napric vybranymi slozkami, ULOZI je jako .msg a
# u kazdeho VYPISE diagnosticke MAPI vlastnosti precteni ze ZIVE
# polozky (Message-ID 0x1035, SenderName 0x0C1A, SentRepresentingName
# 0x0042, addrtype 0x0C1E/0x0064, ReportText 0x1001, PrimarySendAccount
# 0x0E28, MessageFlags 0x0E07, item.Sent). Cil: porovnat, zda tyto
# vlastnosti zustanou i v ulozenem .msg (olefile kontrola doma).
# Pouziti: Spustit v JNJ Pythonu (Thonny), Outlook s JNJ schrankou.
# pip install pywin32 ; python jnj_dump_recipient_msgs_v1.0.py
# Skript NIC neodesila ani nemaze, jen CTE a uklada .msg kopie.
# =============================================================================
import os
import re
import sys
import win32com.client # pywin32
# ----------------------------- KONFIGURACE -----------------------------------
SENDER_SMTP = "vbuzalka@its.jnj.com" # ucet (jeho store se prohledava)
TARGET_EMAIL = "rastislav.hustak@fntt.sk" # koho hledame (To NEBO Cc)
# Slozky k prohledani (shoda na NAZEV slozky kdekoli ve strome; vc. podslozek).
# Prazdny seznam + SCAN_ALL=True => projde celou schranku (pomale!).
SCAN_FOLDERS = ["Sent Items", "Drafts", "Deleted Items", "Archive", "Inbox"]
SCAN_ALL = False
# Kam ulozit .msg kopie (na JNJ stroji). Vytvori se, kdyz neexistuje.
OUTPUT_DIR = r"C:\Users\vbuzalka\hustak_dump"
# -----------------------------------------------------------------------------
OL_MSG = 3 # olMSG (SaveAs typ)
OL_FOLDER_SENT = 5
PA = "http://schemas.microsoft.com/mapi/proptag/0x{:s}"
# Diagnosticke tagy (PT_UNICODE 001F, dlouhe 0003)
TAGS = [
("Msg-ID", "1035001F"),
("SenderName", "0C1A001F"),
("SenderAddrType", "0C1E001F"),
("SentRepName", "0042001F"),
("SentRepAddrType", "0064001F"),
("ReportText", "1001001F"),
("PrimarySendAcct", "0E28001F"),
]
TAG_MSGFLAGS = "0E070003"
TAG_RCPT_ADDRTYPE = "3002001F"
def smtp_of(recipient):
try:
return (recipient.PropertyAccessor.GetProperty(PA.format("39FE001E")) or "").lower()
except Exception:
try:
return (recipient.Address or "").lower()
except Exception:
return ""
def get_prop(item, tag):
try:
v = item.PropertyAccessor.GetProperty(PA.format(tag))
return v
except Exception:
return None
def get_store_root(ns):
try:
for acct in ns.Accounts:
if (acct.SmtpAddress or "").lower() == SENDER_SMTP.lower():
return acct.DeliveryStore.GetRootFolder()
except Exception:
pass
return ns.GetDefaultFolder(OL_FOLDER_SENT).Parent # fallback: koren default store
def iter_target_folders(root):
"""Yield slozek, ktere se maji skenovat (dle nazvu + jejich podslozky)."""
def walk(folder, inscope):
scope = inscope or SCAN_ALL or (folder.Name in SCAN_FOLDERS)
if scope:
yield folder
try:
for sub in folder.Folders:
yield from walk(sub, scope)
except Exception:
pass
yield from walk(root, False)
def safe(s, n=40):
s = re.sub(r"[^A-Za-z0-9._-]+", "_", (s or ""))
return s[:n].strip("_")
def matches_target(item):
"""Vrati ('To'/'Cc') kdyz je TARGET_EMAIL mezi prijemci, jinak None."""
tgt = TARGET_EMAIL.lower()
try:
for r in item.Recipients:
if smtp_of(r) == tgt:
return {1: "To", 2: "Cc", 3: "Bcc"}.get(r.Type, "To")
except Exception:
pass
return None
def main():
os.makedirs(OUTPUT_DIR, exist_ok=True)
outlook = win32com.client.Dispatch("Outlook.Application")
ns = outlook.GetNamespace("MAPI")
root = get_store_root(ns)
print(f"Hledam e-maily, kde je prijemce: {TARGET_EMAIL}")
print(f"Slozky: {'VSE' if SCAN_ALL else ', '.join(SCAN_FOLDERS)}")
print(f"Vystup .msg: {OUTPUT_DIR}")
print("=" * 90)
idx = 0
for folder in iter_target_folders(root):
try:
items = folder.Items
except Exception:
continue
for it in list(items):
try:
if it.Class != 43: # olMail
continue
except Exception:
continue
role = matches_target(it)
if not role:
continue
idx += 1
# --- diagnostika ze ZIVE polozky ---
try:
sent_flag = it.Sent
except Exception:
sent_flag = "?"
flags = get_prop(it, TAG_MSGFLAGS)
props = {label: get_prop(it, tag) for label, tag in TAGS}
try:
sent_on = it.SentOn
except Exception:
sent_on = None
try:
entry_tail = (it.EntryID or "")[-20:]
except Exception:
entry_tail = ""
print(f"\n[{idx}] slozka='{folder.Name}' role={role} Sent={sent_flag} flags={flags}")
print(f" subject : {getattr(it,'Subject','')}")
print(f" sent_on : {sent_on}")
print(f" Msg-ID : {props['Msg-ID']}")
print(f" SenderName : {props['SenderName']} (addrtype {props['SenderAddrType']})")
print(f" SentRepName : {props['SentRepName']} (addrtype {props['SentRepAddrType']})")
print(f" PrimarySendAcct: {props['PrimarySendAcct']}")
rt = props["ReportText"]
print(f" ReportText 0x1001: {'ANO -> ' + repr(rt[:120]) if rt else '-'}")
# --- ulozeni .msg ---
fn = f"{idx:02d}_{safe(folder.Name,18)}_{safe(getattr(it,'Subject',''),28)}_{entry_tail}.msg"
path = os.path.join(OUTPUT_DIR, fn)
try:
it.SaveAs(path, OL_MSG)
print(f" ulozeno: {fn}")
except Exception as e:
print(f" !! SaveAs chyba: {e}")
print("\n" + "=" * 90)
print(f"Hotovo. Nalezeno a ulozeno: {idx} polozek do {OUTPUT_DIR}")
print("Prines .msg domu a porovnej vlastnosti olefilem (zive vs ulozene).")
if __name__ == "__main__":
try:
main()
except Exception as e:
print("CHYBA:", e)
sys.exit(1)
+45
View File
@@ -0,0 +1,45 @@
# jnj_scan_failed_sent_v1.0.py
**Verze:** 1.0 · **Datum:** 2026-06-16
JNJ-native (pywin32 / MAPI). Projde **Sent Items za posledních N dní** (default 60),
najde **podezřelé = pravděpodobně neodeslané** e-maily, uloží je jako `.msg`
a vypíše, které příznaky se trefily. **Nic neodesílá ani nemaže.**
## Příznaky (čteno ze ŽIVÉ položky)
- **FAIL_BODY** (silný) — tělo / ReportText obsahuje „could not be sent",
„SendAsDenied", „permission to send the message on behalf",
„TransportSend operation has failed", „MapiExceptionSendAsDenied".
- **SENDAS_BUZ** (silný) — `PrimarySendAccount` (0x0E28) / SentRepresenting (0x0065)
/ Sender (0x0C1F) obsahuje `buzalka.cz` → posíláno přes špatnou identitu.
- **NO_MSGID** (slabý) — chybí Internet Message-ID (0x1035); může být i
provizorní kopie, co se později dokončí.
`STRONG_*` soubory = silný příznak (skoro jistě neodesláno).
`weak_*` soubory = jen NO_MSGID.
## Spuštění (JNJ stroj s Outlookem)
```
pip install pywin32
python jnj_scan_failed_sent_v1.0.py
```
## Konfigurace
- `DAYS` = okno (default 60).
- `OUTPUT_DIR` = kam ukládat `.msg` (default `C:\Users\vbuzalka\sent_suspects`).
- `INCLUDE_NO_MSGID` = ukládat i jen-NO_MSGID položky (default True; dej False,
když chceš jen tvrdé FAIL/SENDAS).
- `SENDER_SMTP` = účet, jehož Sent Items se skenuje.
## Postup
1. Spusť na JNJ → ve výpisu uvidíš podezřelé + uložené `.msg`.
2. Přines `.msg` z `OUTPUT_DIR` domů → olefilem je projdeme a potvrdíme,
které opravdu neodešly (a komu je třeba poslat znovu se správným From).
## Pozn.
- Okno 60 dní = výkon (řazeno SentOn desc, starší se přeskočí brzy).
- Detekce funguje nad **živou** položkou (čerstvý SaveAs) — proto se pouští
přímo na JNJ, ne nad starými batch kopiemi.
- Hlavní příčina selhání: From = `vladimir.buzalka@buzalka.cz` na účtu
`vbuzalka@its.jnj.com` bez SendAs → Exchange odmítne. Viz paměť
project_jnj_unsent_detection.
+191
View File
@@ -0,0 +1,191 @@
# -*- coding: utf-8 -*-
# =============================================================================
# Nazev: jnj_scan_failed_sent_v1.0.py
# Verze: 1.0
# Datum: 2026-06-16
# Popis: JNJ-native (MAPI / pywin32). Projde slozku Odeslane (Sent Items) za
# poslednich N dni a najde PODEZRELE e-maily = pravdepodobne NEODESLANE
# (napr. SendAs denied). Kazdy podezrely ULOZI jako .msg a vypise, ktere
# priznaky se trefily. NIC neodesila ani nemaze, jen CTE a uklada.
# Priznaky podezreni (cteno ze ZIVE polozky):
# FAIL_BODY = telo/ReportText obsahuje "could not be sent" / "SendAsDenied"
# / "permission to send the message on behalf" / "TransportSend"
# SENDAS_BUZ = PrimarySendAccount/SentRepresenting/Sender obsahuje "buzalka.cz"
# NO_MSGID = chybi Internet Message-ID (0x1035) -- slabsi priznak
# Pouziti: JNJ Python (Thonny), Outlook s JNJ schrankou.
# pip install pywin32 ; python jnj_scan_failed_sent_v1.0.py
# =============================================================================
import os
import re
import sys
import datetime
import win32com.client # pywin32
# ----------------------------- KONFIGURACE -----------------------------------
SENDER_SMTP = "vbuzalka@its.jnj.com"
DAYS = 60 # okno: poslednich N dni
OUTPUT_DIR = r"C:\Users\vbuzalka\sent_suspects"
# Ukladat i polozky, ktere maji JEN slaby priznak NO_MSGID (bez FAIL/SENDAS)?
# True = vc. provizornich kopii bez Message-ID (muze byt vic souboru).
INCLUDE_NO_MSGID = True
# -----------------------------------------------------------------------------
OL_MSG = 3
OL_FOLDER_SENT = 5
PA = "http://schemas.microsoft.com/mapi/proptag/0x{:s}"
P_MSGID = "1035001F"
P_SENDACCT = "0E28001F" # PrimarySendAccount
P_SENTREP_EM = "0065001F" # SentRepresentingEmailAddress
P_SENDER_EM = "0C1F001F" # SenderEmailAddress
P_REPORTTEXT = "1001001F" # ReportText (kdyz existuje)
FAIL_SIGNS = [
"could not be sent",
"sendasdenied",
"permission to send the message on behalf",
"transportsend operation has failed",
"mapiexceptionsendasdenied",
"tuto zpravu nelze odeslat", # pro pripad lokalizace
]
def gp(item, tag):
try:
return item.PropertyAccessor.GetProperty(PA.format(tag))
except Exception:
return None
def get_sent_folder(ns):
try:
for acct in ns.Accounts:
if (acct.SmtpAddress or "").lower() == SENDER_SMTP.lower():
return acct.DeliveryStore.GetDefaultFolder(OL_FOLDER_SENT)
except Exception:
pass
return ns.GetDefaultFolder(OL_FOLDER_SENT)
def safe(s, n=34):
return re.sub(r"[^A-Za-z0-9._-]+", "_", (s or ""))[:n].strip("_")
def analyze(item):
"""Vrati seznam priznaku (flags) pro polozku."""
flags = []
# 1) FAIL_BODY: telo + ReportText
blob = ""
try:
blob += (item.Body or "")
except Exception:
pass
rt = gp(item, P_REPORTTEXT)
if rt:
blob += "\n" + str(rt)
low = blob.lower()
if any(s in low for s in FAIL_SIGNS):
flags.append("FAIL_BODY")
# 2) SENDAS_BUZ: nektera z odesilatelskych poloz. obsahuje buzalka.cz
for tag in (P_SENDACCT, P_SENTREP_EM, P_SENDER_EM):
v = gp(item, tag)
if v and "buzalka.cz" in str(v).lower():
flags.append("SENDAS_BUZ")
break
# 3) NO_MSGID
mid = gp(item, P_MSGID)
if not mid:
flags.append("NO_MSGID")
return flags, (mid or "")
def main():
os.makedirs(OUTPUT_DIR, exist_ok=True)
cutoff = datetime.date.today() - datetime.timedelta(days=DAYS)
outlook = win32com.client.Dispatch("Outlook.Application")
ns = outlook.GetNamespace("MAPI")
sent = get_sent_folder(ns)
items = sent.Items
items.Sort("[SentOn]", True) # nejnovejsi prvni
print(f"Slozka : {sent.FolderPath}")
print(f"Okno : poslednich {DAYS} dni (od {cutoff.isoformat()})")
print(f"Vystup : {OUTPUT_DIR}")
print(f"NO_MSGID se uklada: {INCLUDE_NO_MSGID}")
print("=" * 90)
scanned = saved = strong = 0
for it in list(items):
try:
if it.Class != 43:
continue
except Exception:
continue
# datum + early stop
try:
s = it.SentOn
sdate = datetime.date(s.year, s.month, s.day)
except Exception:
sdate = None
if sdate is not None:
if sdate < cutoff:
break # dale uz jen starsi (serazeno desc)
scanned += 1
flags, mid = analyze(it)
if not flags:
continue
is_strong = ("FAIL_BODY" in flags) or ("SENDAS_BUZ" in flags)
if not is_strong and not (INCLUDE_NO_MSGID and "NO_MSGID" in flags):
continue
saved += 1
if is_strong:
strong += 1
subj = ""
try:
subj = it.Subject or ""
except Exception:
pass
try:
tail = (it.EntryID or "")[-20:]
except Exception:
tail = ""
tagstr = "+".join(flags)
print(f"\n[{saved}] {sdate} flags={tagstr}")
print(f" subj : {subj}")
print(f" msgid: {mid if mid else '<chybi>'}")
fn = f"{('STRONG' if is_strong else 'weak')}_{sdate}_{safe(subj,30)}_{tail}.msg"
path = os.path.join(OUTPUT_DIR, fn)
try:
it.SaveAs(path, OL_MSG)
print(f" ulozeno: {fn}")
except Exception as e:
print(f" !! SaveAs chyba: {e}")
print("\n" + "=" * 90)
print(f"Prohledano (v okne): {scanned}")
print(f"Ulozeno podezrelych: {saved} (z toho silnych FAIL/SENDAS: {strong})")
print(f"Soubory v: {OUTPUT_DIR} -> prines je domu ke kontrole.")
print("Pozn.: STRONG_* = telo NDR nebo send-account buzalka.cz (skoro jiste neodeslano).")
print(" weak_* = jen chybi Message-ID (muze byt i provizorni kopie, co se pozdeji dokonci).")
if __name__ == "__main__":
try:
main()
except Exception as e:
print("CHYBA:", e)
sys.exit(1)
@@ -0,0 +1,63 @@
# -*- coding: utf-8 -*-
# =============================================================================
# Nazev: promote_sipiq_submitted_v1.0.py
# Verze: 1.0
# Datum: 2026-06-17
# Popis: Posune dane investigatory (KROK 6 - SIPIQ odeslan) na
# KROK "7 - SIPIQ vyplneny" na zaklade Illuminator exportu
# (status "SIPIQ Submitted"). Illuminator = ultimatni zdroj, protoze
# lekar vyplneni SIPIQ nemusi oznamit e-mailem. Predřadi radek do STATUS.
# Pouziti: python promote_sipiq_submitted_v1.0.py (dry-run)
# python promote_sipiq_submitted_v1.0.py --apply
# =============================================================================
import sys
from pymongo import MongoClient
from bson import ObjectId
MONGO_URI = "mongodb://192.168.1.76:27017"
LINE = ("17JUN2026: SIPIQ VYPLNENY — dle Illuminator exportu (status „SIPIQ "
"Submitted“); lekar vyplneni neoznamil, Illuminator = ultimatni zdroj. KROK 7.")
# 13 investigatoru se SIPIQ Submitted v Illuminatoru, v Mongo zatim KROK 6
IDS = [
("6a19832b5fc2213518257969", "Durina Juraj"),
("6a19832b5fc221351825796e", "Falc Matej"),
("6a19832b5fc2213518257954", "Fedurco Miroslav"),
("6a19832b5fc221351825796c", "Gregar Jan"),
("6a19832b5fc221351825794f", "Hlavaty Tibor"),
("6a19832b5fc2213518257973", "Horvath Frantisek"),
("6a19832b5fc221351825796f", "Konecny Michal"),
("6a19832b5fc2213518257972", "Konecny Stefan"),
("6a1c4275aa46d8b608065cec", "Lukac Ludovit"),
("6a19832b5fc2213518257958", "Mihalkanin Lubomir"),
("6a198b661218c31ab0f5ba41", "Pesta Martin"),
("6a19832b5fc221351825795e", "Stepek David"),
("6a198b661218c31ab0f5ba43", "Tichy Michal"),
]
def main():
apply = "--apply" in sys.argv
col = MongoClient(MONGO_URI)["feasibility"]["investigators"]
n = 0
for hid, label in IDS:
oid = ObjectId(hid)
d = col.find_one({"_id": oid}, {"STATUS": 1, "KROK": 1})
if not d:
print(f" !! {label}: NENALEZEN"); continue
krok = d.get("KROK", "")
if not krok.startswith("6"):
print(f" ~~ {label}: KROK={krok} (neni 6) -> preskakuji"); continue
print(f" [{label}] KROK {krok} -> 7 - SIPIQ vyplneny")
if apply:
new_status = LINE + "\n" + (d.get("STATUS", "") or "")
col.update_one({"_id": oid}, {"$set": {
"KROK": "7 - SIPIQ vyplneny", "STATUS": new_status}})
n += 1
print(f"\n{'ZAPSANO' if apply else 'DRY-RUN'}: {n if apply else len(IDS)}/{len(IDS)}")
if not apply:
print(">>> Pro zapis spust s --apply")
if __name__ == "__main__":
main()
+40
View File
@@ -0,0 +1,40 @@
# sipiq_import_v1.2 — import SIPIQ odpovědí (folder workflow + provenance)
**Verze:** 1.2 · **Datum:** 2026-06-17 · **Studie:** 77242113UCO3002 (ICONIC / DAWN)
## Změny
- **v1.2:** ke každé odpovědi `source_exported_at` = **datum/čas reportu podle filesystému**
(mtime CSV souboru). Mimo content-hash → nezpůsobuje zbytečné UPDATE; backfilluje se i na
"beze změny" cestě. v1.1 → `Feasibility\TRASH`.
- **v1.1:** FOLDER workflow (`--folder`) — sebere *.csv, delta import, přesun do `Zpracováno`.
## Kolekce
- `sipiq_questions` — slovník dotazníku (rekonstrukce SIPIQ jako v PDF).
- `sipiq_responses` — 1 dok = 1 odpověď (`_id`=ResponseId), ploché `answers{}`,
soft-link `investigator_oid`, `source_file` + `source_exported_at`, delta + `history[]`.
Zdroj = Qualtrics **CSV** (ř.1 Qcode, ř.2 text otázky, ř.3 ImportId=QID). Export labels,
desetinná tečka, recode unanswered vypnuté.
## Delta (přepíše JEN změněná data)
nová→INSERT; beze změn (shodný `content_sha256`)→jen `last_seen_at` + `source_file` + `source_exported_at`;
změna→`$set` jen změněných polí + `$push` do `history[]`.
## Soft-link na investigators (nedestruktivní)
pi_email → email/email2 (lower), pak recipient_email, fallback příjmení (bez diakritiky)+země.
## Použití
```
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.2.py --dry-run # folder režim, default složka
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.2.py --apply
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.2.py --folder "<cesta>" --apply
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.2.py --csv "<cesta.csv>" --apply # jediný soubor, NEpřesouvá
```
Default složka `…\77242113UCO2001\ImportSIPIQcompled`; přesun do `Zpracováno` jen v `--apply` + folder režimu.
`--scope czsk` (default) / `all`. Default = dry-run.
## Workflow
Uživatel pokládá kompletní SIPIQ reporty (Qualtrics CSV, název
`ICONIC+Phase+3b+UC+Study+(77242113UCO3002)_SipIQ_V1_13MAY2026_<datum>_<čas>.csv`) do
`ImportSIPIQcompled\`. Po `--apply` se naimportují (delta) a přesunou do `Zpracováno\`.
`source_exported_at` se bere z mtime souboru (datum/čas reportu dle filesystému).
+489
View File
@@ -0,0 +1,489 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
sipiq_import_v1.2.py
====================
Verze: 1.2
Datum: 2026-06-17
Autor: Claude Code (pro MUDr. Vladimíra Buzalku)
Změny proti v1.1
----------------
- PROVENANCE: ke každé odpovědi se ukládá `source_exported_at` = datum/čas reportu
podle FILESYSTÉMU (mtime CSV souboru). Mimo content-hash → nezpůsobuje zbytečné
UPDATE; backfilluje se i na "beze změny" cestě. Stará v1.1 ponechána v TRASH.
Změny proti v1.0
----------------
- FOLDER WORKFLOW (v1.1): režim --folder sebere *.csv ve složce, naimportuje (delta)
a přesune do podsložky `Zpracováno`. Default složka =
U:\\PythonProject\\Janssen\\Feasibility\\77242113UCO2001\\ImportSIPIQcompled.
Popis
-----
Import SIPIQ odpovědí (Qualtrics CSV export, studie 77242113UCO3002 / ICONIC DAWN)
do MongoDB db `feasibility`. Dvě kolekce:
* sipiq_questions slovník dotazníku (1 dok = 1 logická otázka).
* sipiq_responses 1 dok = 1 odpověď (_id = Qualtrics ResponseId), ploché answers{},
soft-link investigator_oid, delta bookkeeping + history[].
DELTA import (přepíše JEN změněná data): nová->insert; beze změn->jen last_seen_at;
změna->$set jen změněných polí + push do history[].
Použití
-------
python sipiq_import_v1.2.py --dry-run # folder režim, default složka
python sipiq_import_v1.2.py --apply
python sipiq_import_v1.2.py --folder "<cesta>" --apply
python sipiq_import_v1.2.py --csv "<cesta.csv>" --apply # jediný soubor (NEpřesouvá)
Závislosti: pymongo (.venv). Mongo 192.168.1.76:27017, bez auth.
"""
import argparse
import csv
import glob
import hashlib
import json
import os
import re
import shutil
import sys
import unicodedata
from datetime import datetime, timezone
try:
from pymongo import MongoClient
except ImportError:
print("CHYBA: pymongo není nainstalován v aktuálním pythonu.", file=sys.stderr)
raise
MONGO_URI = "mongodb://192.168.1.76:27017"
DB_NAME = "feasibility"
COL_Q = "sipiq_questions"
COL_R = "sipiq_responses"
DEFAULT_FOLDER = r"U:\PythonProject\Janssen\Feasibility\77242113UCO2001\ImportSIPIQcompled"
PROCESSED_SUBDIR = "Zpracováno"
META_COLS = {
"StartDate", "EndDate", "Status", "IPAddress", "Progress", "Duration (in seconds)",
"Finished", "RecordedDate", "ResponseId", "RecipientLastName", "RecipientFirstName",
"RecipientEmail", "ExternalReference", "LocationLatitude", "LocationLongitude",
"DistributionChannel", "UserLanguage",
}
PROMOTE = [
"site_name", "site_address", "site_city", "site_state", "site_postcode", "site_country",
"pi_first_name", "pi_last_name", "pi_phone", "pi_email",
"sdl_site_id", "fire_site_id", "fire_investigator_id", "mailinglist_id",
"survey_generated_by", "Date", "Time",
]
SECTION_BY_QNUM = {}
def _sec(rng, name):
for n in rng:
SECTION_BY_QNUM[n] = name
_sec([2], "J&J Internal Assessment")
_sec([6, 7, 8, 9, 10, 11, 12, 13], "Contact Information")
_sec(range(14, 22), "Confidentiality Statement")
_sec([25, 26, 27], "Interest")
_sec([29, 30, 31, 32, 33, 34], "Protocol Requirements")
_sec([36, 37, 38], "Enrollment")
_sec([40, 41, 42, 43], "Patient Demographics Overview")
_sec([45, 46, 47, 48, 49], "Site Overview")
_sec([51], "Operational Considerations")
_sec([53, 54], "Comments")
_sec([57, 58, 59, 60, 61], "Patient Population")
_sec([63, 64, 65, 66, 67], "Site Experience and Staffing")
_sec([69], "Equipment and Facility Requirements")
_sec([71, 72, 73, 74, 75], "Institutional Review Board, Ethics Committee, and Contracts")
STEM_OVERRIDE = {
"Q31": "At your site, at what line(s) of treatment do you most commonly prescribe "
"vedolizumab for patients with moderately to severely active ulcerative colitis?",
"Q63": "Do you or your site staff have experience in performing the following types of "
"study assessments/procedures?",
"Q64": "The following personnel are required to run the study. "
"Will your site have the following available?",
"Q69": "The following equipment and facilities are required to run the studies. "
"Are these available at your site?",
}
def now_iso():
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
def file_mtime_iso(path):
return datetime.fromtimestamp(os.path.getmtime(path)).astimezone().isoformat(timespec="seconds")
def strip_accents(s):
if not s:
return ""
return "".join(c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c))
def norm_name(s):
return re.sub(r"\s+", " ", strip_accents(s or "").lower()).strip()
def sanitize_key(qcode):
return qcode.replace("#", "_").replace(".", "_")
def qnum(qcode):
m = re.match(r"Q(\d+)", qcode)
return int(m.group(1)) if m else None
def qbase(qcode):
m = re.match(r"(Q\d+)", qcode)
return m.group(1) if m else qcode
def import_id(h3_cell):
try:
return json.loads(h3_cell).get("ImportId", "")
except Exception:
return h3_cell
def split_text(text):
parts = [p.strip() for p in re.split(r"\s+-\s+", text)]
stem = parts[0]
if len(parts) == 1:
return stem, None
label_parts = [p for p in parts[1:] if p.lower() != "selected choice"]
label_parts = [p for p in label_parts if not re.fullmatch(r"Q\d+#\d+", p)]
return stem, (" - ".join(label_parts) if label_parts else None)
def detect_type(qcode, observed):
has_hash = "#" in qcode
vals = [v for v in observed if v]
yesno = vals and all(v in ("Yes", "No") for v in vals)
numeric = vals and all(re.fullmatch(r"-?\d+(\.\d+)?", v) for v in vals)
if has_hash and yesno:
return "matrix_yesno"
if has_hash and numeric:
return "matrix_percent"
if has_hash:
return "matrix"
if numeric:
return "numeric"
if yesno:
return "yesno"
return "single_or_text"
def load_csv(path):
with open(path, encoding="utf-8-sig", newline="") as fh:
rows = list(csv.reader(fh))
h1, h2, h3 = rows[0], rows[1], rows[2]
data = rows[3:]
cols = [{"i": i, "code": c, "text": t, "qid": import_id(j)}
for i, (c, t, j) in enumerate(zip(h1, h2, h3))]
return cols, data
def col_getter(cols, data):
idx = {c["code"]: c["i"] for c in cols}
def get(row, code):
i = idx.get(code)
return (row[i].strip() if i is not None and i < len(row) else "")
return get, idx
def is_question_col(code):
return bool(re.match(r"Q\d", code))
def build_questions(cols, data):
qcols = [c for c in cols if is_question_col(c["code"])]
observed = {c["code"]: set() for c in qcols}
for row in data:
for c in qcols:
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
if v:
observed[c["code"]].add(v)
groups, order_seen = {}, []
for c in qcols:
base = qbase(c["code"])
if base not in groups:
groups[base] = {"_id": base, "order": c["i"], "qnum": qnum(c["code"]),
"section": SECTION_BY_QNUM.get(qnum(c["code"]), "Other"),
"qids": [], "text": split_text(c["text"])[0],
"items": [], "_obs": set(), "_types": []}
order_seen.append(base)
g = groups[base]
bq = re.match(r"(QID\d+)", c["qid"] or "")
if bq and bq.group(1) not in g["qids"]:
g["qids"].append(bq.group(1))
_, label = split_text(c["text"])
item = {"key": sanitize_key(c["code"]), "qcode": c["code"], "qid": c["qid"]}
if label:
item["label"] = label
g["items"].append(item)
g["_obs"] |= observed[c["code"]]
g["_types"].append(detect_type(c["code"], observed[c["code"]]))
out = []
for n, base in enumerate(order_seen):
g = groups[base]
obs = sorted(g.pop("_obs"))
types = g.pop("_types")
gtype = max(set(types), key=types.count) if types else "single_or_text"
g["type"] = gtype
if gtype in ("yesno", "matrix_yesno"):
g["options"] = ["Yes", "No"]
elif gtype == "single_or_text" and obs and len(obs) <= 12:
g["options"] = obs
else:
g["options"] = []
if base in STEM_OVERRIDE:
g["text"] = STEM_OVERRIDE[base]
g["order"] = n
if len(g["items"]) == 1 and "label" not in g["items"][0]:
g["items"] = []
out.append(g)
return out
def build_response(cols, get, row, source_file):
rid = get(row, "ResponseId")
answers = {}
for c in cols:
if is_question_col(c["code"]):
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
if v:
answers[sanitize_key(c["code"])] = v
meta = {
"start_date": get(row, "StartDate") or None,
"end_date": get(row, "EndDate") or None,
"recorded_date": get(row, "RecordedDate") or None,
"status": get(row, "Status") or None,
"progress": int(get(row, "Progress")) if get(row, "Progress").isdigit() else (get(row, "Progress") or None),
"finished": get(row, "Finished") in ("True", "1", "TRUE"),
"duration_sec": int(get(row, "Duration (in seconds)")) if get(row, "Duration (in seconds)").isdigit() else None,
"user_language": get(row, "UserLanguage") or None,
"distribution_channel": get(row, "DistributionChannel") or None,
"ip_address": get(row, "IPAddress") or None,
"location_lat": get(row, "LocationLatitude") or None,
"location_lng": get(row, "LocationLongitude") or None,
"survey_date": get(row, "Date") or None,
"survey_time": get(row, "Time") or None,
}
doc = {
"_id": rid, "study": "77242113UCO3002",
"site_country": get(row, "site_country") or None,
"site_name": get(row, "site_name") or None,
"site_city": get(row, "site_city") or None,
"site_state": get(row, "site_state") or None,
"site_postcode": get(row, "site_postcode") or None,
"site_address": get(row, "site_address") or None,
"pi_first_name": get(row, "pi_first_name") or None,
"pi_last_name": get(row, "pi_last_name") or None,
"pi_email": (get(row, "pi_email") or "").lower() or None,
"pi_phone": get(row, "pi_phone") or None,
"sdl_site_id": get(row, "sdl_site_id") or None,
"fire_site_id": get(row, "fire_site_id") or None,
"fire_investigator_id": get(row, "fire_investigator_id") or None,
"mailinglist_id": get(row, "mailinglist_id") or None,
"survey_generated_by": get(row, "survey_generated_by") or None,
"recipient_email": (get(row, "RecipientEmail") or "").lower() or None,
"recipient_last_name": get(row, "RecipientLastName") or None,
"recipient_first_name": get(row, "RecipientFirstName") or None,
"meta": meta,
"is_full_sipiq": any(k.startswith(("Q57", "Q58", "Q59", "Q63", "Q66", "Q71")) for k in answers),
"interested": answers.get("Q25"),
"answers": answers,
"investigator_oid": None, "investigator_match": None,
"source_file": source_file,
}
return doc
def content_hash(doc):
payload = {k: doc[k] for k in doc if k not in
("content_sha256", "first_imported_at", "last_seen_at", "last_updated_at",
"history", "investigator_oid", "investigator_match", "source_file",
"source_exported_at")}
return hashlib.sha256(json.dumps(payload, sort_keys=True, ensure_ascii=False,
default=str).encode("utf-8")).hexdigest()
def load_investigators(db):
inv = list(db.investigators.find(
{"zeme": {"$in": ["Czech Republic", "Slovakia"]}},
{"prijmeni": 1, "jmeno": 1, "email": 1, "email2": 1, "zeme": 1, "KROK": 1}))
by_email, by_name = {}, {}
for d in inv:
for ef in ("email", "email2"):
e = (d.get(ef) or "").lower().strip()
if e:
by_email.setdefault(e, d)
nm = norm_name(d.get("prijmeni"))
if nm:
by_name.setdefault((nm, d.get("zeme")), []).append(d)
return inv, by_email, by_name
def soft_link(doc, by_email, by_name):
e = (doc.get("pi_email") or "").lower().strip()
if e and e in by_email:
d = by_email[e]; return d["_id"], f"email:{e}", d
e2 = (doc.get("recipient_email") or "").lower().strip()
if e2 and e2 in by_email:
d = by_email[e2]; return d["_id"], f"recipient_email:{e2}", d
nm = norm_name(doc.get("pi_last_name"))
cand = by_name.get((nm, doc.get("site_country")), [])
if len(cand) == 1:
return cand[0]["_id"], f"prijmeni:{nm}", cand[0]
if len(cand) > 1:
return None, f"prijmeni_ambiguous:{nm}({len(cand)})", None
return None, "NENALEZENO", None
def diff_docs(old, new):
changes = []
def walk(prefix, o, n):
for k in sorted(set((o or {}).keys()) | set((n or {}).keys())):
ov, nv = (o or {}).get(k), (n or {}).get(k)
if isinstance(ov, dict) or isinstance(nv, dict):
walk(f"{prefix}{k}.", ov or {}, nv or {})
elif ov != nv:
changes.append({"key": f"{prefix}{k}", "old": ov, "new": nv})
for field in ("answers", "meta"):
walk(f"{field}.", old.get(field, {}), new.get(field, {}))
for k in ("site_name", "pi_email", "pi_last_name", "interested", "is_full_sipiq"):
if old.get(k) != new.get(k):
changes.append({"key": k, "old": old.get(k), "new": new.get(k)})
return changes
# ---------------------------------------------------------------------------
def process_file(db, csv_path, scope, dry, by_email, by_name):
source_file = os.path.basename(csv_path)
exported_at = file_mtime_iso(csv_path) # datum/čas reportu dle filesystému (mtime)
cols, data = load_csv(csv_path)
get, _ = col_getter(cols, data)
if scope == "czsk":
data = [r for r in data if get(r, "site_country") in ("Czech Republic", "Slovakia")]
print(f"\n########## {source_file} (rozsah={scope}, odpovědí={len(data)}, export={exported_at}) ##########")
cols_all, data_all = load_csv(csv_path)
questions = build_questions(cols_all, data_all)
docs, link_rows = [], []
for r in data:
doc = build_response(cols, get, r, source_file)
oid, how, matched = soft_link(doc, by_email, by_name)
doc["investigator_oid"] = oid
doc["investigator_match"] = how
doc["source_exported_at"] = exported_at
doc["content_sha256"] = content_hash(doc)
docs.append(doc)
link_rows.append((doc, how, matched))
existing = {d["_id"]: d for d in db[COL_R].find({}, {"content_sha256": 1})}
to_insert = [d for d in docs if d["_id"] not in existing]
to_update = [d for d in docs if d["_id"] in existing and existing[d["_id"]].get("content_sha256") != d["content_sha256"]]
unchanged = [d for d in docs if d["_id"] in existing and existing[d["_id"]].get("content_sha256") == d["content_sha256"]]
mk7 = mko = un = 0
for doc, how, m in link_rows:
krok = (m or {}).get("KROK", "")
if m and str(krok).startswith("7"): mk7 += 1
elif m: mko += 1
else: un += 1
print(f" slovník: {len(questions)} otázek | soft-link: KROK7={mk7}, jiný={mko}, nenapárováno={un}")
print(f" delta: INSERT={len(to_insert)}, UPDATE={len(to_update)}, beze změny={len(unchanged)}")
if un:
for doc, how, m in link_rows:
if not m:
print(f" ✗ NENAPÁROVÁNO: {doc.get('pi_last_name')} / {doc.get('pi_email')} ({how})")
if dry:
print(" [DRY-RUN] nezapsáno")
return {"insert": 0, "update": 0, "unchanged": 0, "wrote": False}
for q in questions:
db[COL_Q].replace_one({"_id": q["_id"]}, q, upsert=True)
ts = now_iso()
ni = nu = ns = 0
for d in docs:
cur = db[COL_R].find_one({"_id": d["_id"]})
if cur is None:
d.update({"first_imported_at": ts, "last_seen_at": ts, "last_updated_at": ts, "history": []})
db[COL_R].insert_one(d); ni += 1
elif cur.get("content_sha256") != d["content_sha256"]:
changes = diff_docs(cur, d)
db[COL_R].update_one({"_id": d["_id"]}, {
"$set": {**{k: d[k] for k in d if k != "_id"}, "last_seen_at": ts, "last_updated_at": ts},
"$push": {"history": {"changed_at": ts, "source_file": source_file, "changes": changes}}})
nu += 1
else:
db[COL_R].update_one({"_id": d["_id"]}, {"$set": {
"last_seen_at": ts, "source_file": source_file, "source_exported_at": d["source_exported_at"]}})
ns += 1
print(f" [APPLY] questions upsert={len(questions)} | responses insert={ni}, update={nu}, beze změny={ns}")
return {"insert": ni, "update": nu, "unchanged": ns, "wrote": True}
def move_to_processed(csv_path, folder):
dest_dir = os.path.join(folder, PROCESSED_SUBDIR)
os.makedirs(dest_dir, exist_ok=True)
base = os.path.basename(csv_path)
dest = os.path.join(dest_dir, base)
if os.path.exists(dest):
stem, ext = os.path.splitext(base)
n = 1
while os.path.exists(os.path.join(dest_dir, f"{stem}_{n}{ext}")):
n += 1
dest = os.path.join(dest_dir, f"{stem}_{n}{ext}")
shutil.move(csv_path, dest)
print(f" -> přesunuto do {PROCESSED_SUBDIR}\\{os.path.basename(dest)}")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--csv", help="jediný soubor (NEpřesouvá)")
ap.add_argument("--folder", default=DEFAULT_FOLDER, help="složka se SIPIQ CSV (přesune do Zpracováno)")
ap.add_argument("--scope", choices=["czsk", "all"], default="czsk")
ap.add_argument("--apply", action="store_true")
ap.add_argument("--dry-run", action="store_true")
args = ap.parse_args()
dry = not args.apply
if args.csv:
files, move_mode, folder = [args.csv], False, None
else:
folder = args.folder
files = sorted(glob.glob(os.path.join(folder, "*.csv")))
move_mode = True
print(f"Složka: {folder}\nNalezeno CSV ke zpracování: {len(files)}")
if not files:
print("Nic ke zpracování (žádné *.csv).")
return
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=8000)
db = client[DB_NAME]
client.admin.command("ping")
inv, by_email, by_name = load_investigators(db)
print(f"Investigatorů CZ+SK v DB: {len(inv)}")
total = {"insert": 0, "update": 0, "unchanged": 0}
for f in files:
res = process_file(db, f, args.scope, dry, by_email, by_name)
for k in total:
total[k] += res[k]
if move_mode and res["wrote"]:
move_to_processed(f, folder)
print(f"\n=== CELKEM: insert={total['insert']}, update={total['update']}, beze změny={total['unchanged']} ===")
if dry:
print("[DRY-RUN] Nic se nezapsalo ani nepřesunulo. Ostrý běh: --apply")
client.close()
if __name__ == "__main__":
main()
+38
View File
@@ -0,0 +1,38 @@
# store_cda_seaweed_v1.0.py
**Verze:** 1.0 · **Datum:** 2026-06-17
## Účel
Uloží podepsané CDA (PDF) z e-mailů asistentek (CTA) do Mongo
`feasibility.investigators` do pole `cda.*` a posune lékaře na
`KROK "5 - CDA podepsano"`.
Na rozdíl od `store_cda_batch` (stahuje `.msg` přes SFTP z Toweru a tahá přílohu
přes `extract_msg`) tahle verze stahuje PDF **přímo ze SeaweedFS** přes
`seaweed_url`, který parser ukládá k příloze v `emaily."vbuzalka@its.jnj.com"`
(`attachments[].seaweed_url` + `sha256`). Jednodušší, bez SFTP.
## Jak to funguje
- `MAPPING` = explicitní párování `investigator _id → (seaweed_url, filename, sha256, size, source_msg_id)`.
- Pro každý záznam: stáhne PDF (urllib), ověří **SHA256 + velikost + PDF hlavičku**,
base64-zakóduje a uloží do `cda`:
`data_base64, data_sha256, data_filename, data_mime, data_size, data_stored_at,
data_source_msg` + metadata `stav="podepsano", soubor, zdroj`.
- Nastaví `KROK = "5 - CDA podepsano"` a předřadí řádek do `STATUS`.
- `_id` se konvertuje na `ObjectId` (čisté pymongo nekonvertuje string→ObjectId samo).
## Použití
```
.venv\Scripts\python.exe Feasibility\store_cda_seaweed_v1.0.py # dry-run (ověří stažení+SHA, nezapisuje)
.venv\Scripts\python.exe Feasibility\store_cda_seaweed_v1.0.py --apply # zapíše do Mongo
```
## Běh 17JUN2026 (--apply)
Uloženo 5/5 (všechny SHA256 OK), KROK 4 → 5:
Závada Filip, Bruncák Michal (FNsP B. Bystrica), Machytka Evžen (Asclepiades),
Pumprla Jiří (PreventaMed), Zapotocká Júlia (PAV-MED).
GASTROMART/Molnár přeskočen (už KROK 6, CDA dříve uloženo).
## Závislosti
`pymongo`, `bson` (+ stdlib). SeaweedFS volume server `192.168.1.50:8888`.
Mongo `192.168.1.76:27017`.
+126
View File
@@ -0,0 +1,126 @@
# -*- coding: utf-8 -*-
# =============================================================================
# Nazev: store_cda_seaweed_v1.0.py
# Verze: 1.0
# Datum: 2026-06-17
# Popis: Ulozi podepsane CDA (PDF) z e-mailu asistentek do Mongo
# feasibility.investigators do pole cda.* a posune lekare na
# KROK "5 - CDA podepsano". PDF se stahuji primo ze SeaweedFS
# (seaweed_url z attachments v emaily."vbuzalka@its.jnj.com"),
# overuje se SHA256 proti metadatum z Mongo.
# Pouziti: python store_cda_seaweed_v1.0.py (dry-run / nahled)
# python store_cda_seaweed_v1.0.py --apply (zapise do Mongo)
# Pozn.: MAPPING nize = explicitni parovani investigator -> CDA priloha.
# Jen stdlib + pymongo. SeaweedFS host 192.168.1.50:8888.
# =============================================================================
import sys
import base64
import hashlib
import urllib.request
from datetime import datetime, timezone
from pymongo import MongoClient
from bson import ObjectId
MONGO_URI = "mongodb://192.168.1.76:27017"
DBN, COL = "feasibility", "investigators"
# (investigator _id, seaweed_url, filename, sha256, size, source_msg_id, label)
MAPPING = [
("6a198b661218c31ab0f5ba57",
"http://192.168.1.50:8888/mail-attachments/1a/86/1a86e987b9d3da57c1d863b47734133f2e2d7eae3f5cfe91112c475eb86d86e9",
"CZ_CDA PI_MUDr. Filip Zavada_fully signed_16Jun2026.pdf",
"1a86e987b9d3da57c1d863b47734133f2e2d7eae3f5cfe91112c475eb86d86e9",
479026, "<CH2PR07MB7190A5538ACDC1D49F8B430780E52@CH2PR07MB7190.namprd07.prod.outlook.com>",
"Zavada Filip"),
("6a19832b5fc2213518257957",
"http://192.168.1.50:8888/mail-attachments/64/b0/64b06d48bfe3c49095e326988f14c04fd5849728b227647f6653b2e3c3095538",
"SK_CDA PI_Bruncak_FNsP BBystrica_fully signed 16Jun2026.pdf",
"64b06d48bfe3c49095e326988f14c04fd5849728b227647f6653b2e3c3095538",
498069, "<SA1PR07MB952874B8654156369CDE44448CE52@SA1PR07MB9528.namprd07.prod.outlook.com>",
"Bruncak Michal"),
("6a19832b5fc2213518257961",
"http://192.168.1.50:8888/mail-attachments/c2/72/c272ca62bd27ca10aed35cb54054d880f4f0e2f59940ed3b067b17d51a9ac041",
"CZ_CDA Institution_Asclepiades s.r.o._MUDr. Machytka_16Jun2026.pdf",
"c272ca62bd27ca10aed35cb54054d880f4f0e2f59940ed3b067b17d51a9ac041",
460977, "<PH0PR07MB97879A9C9BF9C00D38D4798A9FE52@PH0PR07MB9787.namprd07.prod.outlook.com>",
"Machytka Evzen (Asclepiades)"),
("6a19832b5fc2213518257967",
"http://192.168.1.50:8888/mail-attachments/99/37/99372c399be3b001428ef4b36d43e250dedced5955de5d1f3a2d63a9f0c1728b",
"CZ_CDA institution_PreventaMed sro_fully signed_16Jun2026.pdf",
"99372c399be3b001428ef4b36d43e250dedced5955de5d1f3a2d63a9f0c1728b",
457745, "<CH2PR07MB719008DB0B3CAFD764AE2E8280E52@CH2PR07MB7190.namprd07.prod.outlook.com>",
"Pumprla Jiri (PreventaMed)"),
("6a1c4275aa46d8b608065ce9",
"http://192.168.1.50:8888/mail-attachments/94/95/9495c742407873efd8dd9713e1dc962cb08e55e0d3690e4a79a90132ee358dee",
"SK_CDA Institution_PAV-MED s r.o_fully signed_15Jun2026.pdf",
"9495c742407873efd8dd9713e1dc962cb08e55e0d3690e4a79a90132ee358dee",
460246, "<CH2PR07MB719008DB0B3CAFD764AE2E8280E52@CH2PR07MB7190.namprd07.prod.outlook.com>",
"Zapotocka Julia (PAV-MED)"),
]
def fetch(url):
with urllib.request.urlopen(url, timeout=30) as r:
return r.read()
def main():
apply = "--apply" in sys.argv
cli = MongoClient(MONGO_URI)
col = cli[DBN][COL]
now = datetime.now(timezone.utc).isoformat()
ok = 0
for _id, url, fname, sha, size, src, label in MAPPING:
oid = ObjectId(_id)
doc = col.find_one({"_id": oid}, {"STATUS": 1, "KROK": 1, "cda.stav": 1})
if not doc:
print(f" !! {label}: investigator _id={_id} NENALEZEN"); continue
try:
raw = fetch(url)
except Exception as e:
print(f" !! {label}: stazeni selhalo: {e}"); continue
got = hashlib.sha256(raw).hexdigest()
sha_ok = (got == sha)
size_ok = (len(raw) == size)
head_ok = raw[:5] == b"%PDF-"
print(f" [{label}]")
print(f" soubor : {fname}")
print(f" stazeno : {len(raw)} B (ocek. {size}) {'OK' if size_ok else 'MISMATCH'}")
print(f" sha256 : {'OK' if sha_ok else 'MISMATCH! ' + got}")
print(f" PDF hdr : {'OK' if head_ok else 'NENI PDF'}")
print(f" KROK : {doc.get('KROK')} -> 5 - CDA podepsano")
if not (sha_ok and size_ok and head_ok):
print(" >> PRESKAKUJI (kontrola selhala)"); continue
if not apply:
ok += 1; continue
b64 = base64.b64encode(raw).decode("ascii")
old_status = doc.get("STATUS", "") or ""
new_line = (f"17JUN2026: podepsane CDA ULOZENO do Mongo (cda.data) — {fname} "
f"(z e-mailu asistentky). KROK 5, pripraveno na SIPIQ.")
col.update_one({"_id": oid}, {"$set": {
"KROK": "5 - CDA podepsano",
"STATUS": new_line + "\n" + old_status,
"cda.stav": "podepsano",
"cda.soubor": fname,
"cda.zdroj": "e-mail asistentky (SeaweedFS)",
"cda.data_base64": b64,
"cda.data_sha256": sha,
"cda.data_filename": fname,
"cda.data_mime": "application/pdf",
"cda.data_size": len(raw),
"cda.data_stored_at": now,
"cda.data_source_msg": src,
}})
ok += 1
print(" >> ULOZENO + KROK 5")
print(f"\n{'ZAPSANO' if apply else 'DRY-RUN OK'}: {ok}/{len(MAPPING)}")
if not apply:
print(">>> Pro zapis spust s --apply")
if __name__ == "__main__":
main()