z230
This commit is contained in:
+517
File diff suppressed because one or more lines are too long
@@ -0,0 +1,70 @@
|
||||
# sipiq_import_v1.0 — import SIPIQ odpovědí do MongoDB
|
||||
|
||||
**Verze:** 1.0 · **Datum:** 2026-06-17 · **Studie:** 77242113UCO3002 (ICONIC / DAWN)
|
||||
|
||||
## Účel
|
||||
Import SIPIQ odpovědí (Qualtrics CSV export) do MongoDB `feasibility` tak, aby šlo:
|
||||
1. **křížově analyzovat** „otázka × otázka" (ploché `answers{}` keyed by Qcode),
|
||||
2. **zrekonstruovat kompletní SIPIQ** jako v prázdném PDF, jen vyplněný (slovník otázek
|
||||
se sekcemi / pořadím / popisky podčástí / typem / options).
|
||||
|
||||
## Vstup
|
||||
Qualtrics **CSV** export (Download a data table → CSV, *Download all fields*, *Export labels*,
|
||||
desetinná **tečka** = NEzaškrtnuto „Use commas for decimals"). CSV má 3 hlavičkové řádky:
|
||||
- ř.1 = Qcode (Q2, Q6_4, Q31#1_1 …)
|
||||
- ř.2 = **text otázky** (legenda)
|
||||
- ř.3 = `{"ImportId":"QID…"}` = QID kód shodný s XML exportem (most XML↔CSV)
|
||||
|
||||
XML export NEobsahuje text otázky (jen QID tagy) → proto importujeme z CSV.
|
||||
|
||||
## Dvě kolekce v `feasibility`
|
||||
### `sipiq_questions` — slovník dotazníku (1 dok = 1 logická otázka)
|
||||
`{_id=Qcode báze (Q63), order, qnum, section, qids[QID…], text, type, items[{key,qcode,qid,label}], options[]}`
|
||||
- `type`: `single_or_text` | `yesno` | `numeric` | `matrix_yesno` | `matrix_percent` | `matrix`
|
||||
- `items[]` = podčásti (řádky matic, části %, kontaktní pole) v pořadí; `key` = sanitizovaný Qcode (`#`/`.`→`_`)
|
||||
- `options[]` = odvozené z pozorovaných hodnot (yes/no a single-choice)
|
||||
- Idempotentní `replace_one(upsert)`. Stav 17JUN2026: **56 otázek** (27 vícedílných).
|
||||
- **STEM_OVERRIDE**: u maticových otázek (Q31/Q63/Q64/Q69) Qualtrics v CSV hlavičce text ořezává „…",
|
||||
proto plné znění doplněno z prázdného SIPIQ PDF.
|
||||
|
||||
### `sipiq_responses` — 1 dok = 1 odpověď
|
||||
- `_id` = **Qualtrics ResponseId** (`R_…`, unikátní, stálý)
|
||||
- identita centra/PI povýšená nahoru (`site_*`, `pi_*`, `sdl_site_id`, `fire_*`, `mailinglist_id`,
|
||||
`recipient_*`) → queryable
|
||||
- `meta{}` = dates, status, progress, finished, duration, jazyk, kanál, IP, geo, survey date/time
|
||||
- `answers{}` = **plochá mapa** Qcode→hodnota (`answers.Q37_1`, `answers.Q63_1_1`) — jádro pro křížovou analýzu
|
||||
- `is_full_sipiq`, `interested` (Q25) pro pohodlí
|
||||
- **`investigator_oid`** = ObjectId ref na `feasibility.investigators` (+`investigator_match` = jak)
|
||||
- delta bookkeeping: `content_sha256`, `source_file`, `first_imported_at`, `last_seen_at`,
|
||||
`last_updated_at`, `history[]`
|
||||
|
||||
## Delta import (přepíše JEN změněná data)
|
||||
- nová odpověď → INSERT
|
||||
- existuje, beze změn (shodný `content_sha256`) → aktualizuje pouze `last_seen_at`
|
||||
- existuje, změna → `$set` jen změněných polí + `$push` do `history[]` `{changed_at, source_file, changes:[{key,old,new}]}`
|
||||
|
||||
## Soft-link na investigators (nedestruktivní)
|
||||
1. `pi_email` == `email`/`email2` (lowercase), 2. `recipient_email`, 3. fallback příjmení
|
||||
(bez diakritiky) + země. Reportuje napárování + KROK. **investigators se NEMĚNÍ.**
|
||||
|
||||
## Použití
|
||||
```
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.0.py --csv "<cesta.csv>" --dry-run
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.0.py --csv "<cesta.csv>" --apply
|
||||
```
|
||||
`--scope czsk` (default, jen CZ+SK) | `--scope all` (všech 276). Mongo 192.168.1.76:27017, bez auth, pymongo.
|
||||
|
||||
## Stav 17JUN2026 (ostrý běh proveden)
|
||||
- `sipiq_questions`: 56 · `sipiq_responses`: 15 (CZ 8 + SK 7)
|
||||
- **soft-link 15/15 přes e-mail, všech 15 = KROK 7** (validace: vyplněné SIPIQ = naši KROK-7 investigátoři)
|
||||
- `investigator_oid` uložen jako ObjectId → připraveno na `$lookup`
|
||||
|
||||
## Dotazy (příklady)
|
||||
```js
|
||||
// křížově: kdo očekává problémy s náborem A má >X eligible
|
||||
db.sipiq_responses.find({"answers.Q33":"Yes"}, {pi_last_name:1,"answers.Q37_1":1})
|
||||
// join s evidencí investigatora
|
||||
db.sipiq_responses.aggregate([{$lookup:{from:"investigators",localField:"investigator_oid",
|
||||
foreignField:"_id",as:"inv"}}])
|
||||
// rekonstrukce SIPIQ: seřaď sipiq_questions dle order, pro každou otázku/item vezmi answers[key]
|
||||
```
|
||||
@@ -0,0 +1,534 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
sipiq_import_v1.0.py
|
||||
====================
|
||||
Verze: 1.0
|
||||
Datum: 2026-06-17
|
||||
Autor: Claude Code (pro MUDr. Vladimíra Buzalku)
|
||||
|
||||
Popis
|
||||
-----
|
||||
Import SIPIQ odpovědí (Qualtrics CSV export, studie 77242113UCO3002 / ICONIC DAWN)
|
||||
do MongoDB db `feasibility`. Cílem je:
|
||||
(a) umožnit křížovou analýzu „otázka × otázka" (ploché odpovědi keyed by Qcode),
|
||||
(b) umožnit zrekonstruovat KOMPLETNÍ SIPIQ tak, jak ho zkoušející vidí v PDF,
|
||||
jen s vyplněnými odpověďmi (slovník otázek se sekcí/pořadím/popisky).
|
||||
|
||||
Dvě kolekce v db `feasibility`:
|
||||
* sipiq_questions – slovník dotazníku (1 dok = 1 logická otázka; section, order,
|
||||
text, items[], type, options). Idempotentní (upsert dle _id).
|
||||
* sipiq_responses – 1 dok = 1 odpověď (_id = Qualtrics ResponseId). Identita centra/PI
|
||||
nahoře, ploché answers{}, meta{}, soft-link investigator_oid,
|
||||
delta bookkeeping (content_sha256, history[], timestamps).
|
||||
|
||||
DELTA import (přepíše JEN změněná data):
|
||||
- nová odpověď -> insert
|
||||
- existuje, beze změn -> aktualizuje pouze last_seen_at (+ source_file)
|
||||
- existuje, něco se změnilo -> $set jen změněných polí + push do history[] {key,old,new}
|
||||
|
||||
Soft-link na feasibility.investigators:
|
||||
- primárně pi_email == email / email2 (lowercase)
|
||||
- fallback příjmení (bez diakritiky, lower) + země (CZ/SK)
|
||||
- nedestruktivní: kolekci investigators NEMĚNÍ, jen ukládá investigator_oid do response.
|
||||
|
||||
Rozsah: default CZ + SK (--scope czsk). --scope all = všech 276.
|
||||
|
||||
Použití:
|
||||
python sipiq_import_v1.0.py --csv "<cesta.csv>" --dry-run
|
||||
python sipiq_import_v1.0.py --csv "<cesta.csv>" --apply
|
||||
|
||||
Závislosti: pymongo (.venv). Mongo 192.168.1.76:27017, bez auth.
|
||||
"""
|
||||
import argparse
|
||||
import csv
|
||||
import hashlib
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
import unicodedata
|
||||
from datetime import datetime, timezone
|
||||
|
||||
try:
|
||||
from pymongo import MongoClient
|
||||
except ImportError:
|
||||
print("CHYBA: pymongo není nainstalován v aktuálním pythonu.", file=sys.stderr)
|
||||
raise
|
||||
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
DB_NAME = "feasibility"
|
||||
COL_Q = "sipiq_questions"
|
||||
COL_R = "sipiq_responses"
|
||||
|
||||
# Qualtrics systémová meta pole (NEjdou do answers)
|
||||
META_COLS = {
|
||||
"StartDate", "EndDate", "Status", "IPAddress", "Progress", "Duration (in seconds)",
|
||||
"Finished", "RecordedDate", "ResponseId", "RecipientLastName", "RecipientFirstName",
|
||||
"RecipientEmail", "ExternalReference", "LocationLatitude", "LocationLongitude",
|
||||
"DistributionChannel", "UserLanguage",
|
||||
}
|
||||
|
||||
# Embedded SDL pole povýšená nahoru do dokumentu (queryable identita)
|
||||
PROMOTE = [
|
||||
"site_name", "site_address", "site_city", "site_state", "site_postcode", "site_country",
|
||||
"pi_first_name", "pi_last_name", "pi_phone", "pi_email",
|
||||
"sdl_site_id", "fire_site_id", "fire_investigator_id", "mailinglist_id",
|
||||
"survey_generated_by", "Date", "Time",
|
||||
]
|
||||
|
||||
# Sekce dle ověřeného katalogu (mapování báze Q-čísla -> sekce v PDF)
|
||||
SECTION_BY_QNUM = {}
|
||||
def _sec(rng, name):
|
||||
for n in rng:
|
||||
SECTION_BY_QNUM[n] = name
|
||||
_sec([2], "J&J Internal Assessment")
|
||||
_sec([6, 7, 8, 9, 10, 11, 12, 13], "Contact Information")
|
||||
_sec(range(14, 22), "Confidentiality Statement")
|
||||
_sec([25, 26, 27], "Interest")
|
||||
_sec([29, 30, 31, 32, 33, 34], "Protocol Requirements")
|
||||
_sec([36, 37, 38], "Enrollment")
|
||||
_sec([40, 41, 42, 43], "Patient Demographics Overview")
|
||||
_sec([45, 46, 47, 48, 49], "Site Overview")
|
||||
_sec([51], "Operational Considerations")
|
||||
_sec([53, 54], "Comments")
|
||||
_sec([57, 58, 59, 60, 61], "Patient Population")
|
||||
_sec([63, 64, 65, 66, 67], "Site Experience and Staffing")
|
||||
_sec([69], "Equipment and Facility Requirements")
|
||||
_sec([71, 72, 73, 74, 75], "Institutional Review Board, Ethics Committee, and Contracts")
|
||||
|
||||
# Plné znění otázek, které Qualtrics v hlavičce CSV ořezává "..." (maticové otázky).
|
||||
# Zdroj: prázdný SIPIQ PDF (ICONIC ... _SipIQ_V1_13MAY2026.pdf).
|
||||
STEM_OVERRIDE = {
|
||||
"Q31": "At your site, at what line(s) of treatment do you most commonly prescribe "
|
||||
"vedolizumab for patients with moderately to severely active ulcerative colitis?",
|
||||
"Q63": "Do you or your site staff have experience in performing the following types of "
|
||||
"study assessments/procedures?",
|
||||
"Q64": "The following personnel are required to run the study. "
|
||||
"Will your site have the following available?",
|
||||
"Q69": "The following equipment and facilities are required to run the studies. "
|
||||
"Are these available at your site?",
|
||||
}
|
||||
|
||||
|
||||
def now_iso():
|
||||
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
|
||||
|
||||
|
||||
def strip_accents(s):
|
||||
if not s:
|
||||
return ""
|
||||
nfkd = unicodedata.normalize("NFKD", s)
|
||||
return "".join(c for c in nfkd if not unicodedata.combining(c))
|
||||
|
||||
|
||||
def norm_name(s):
|
||||
return re.sub(r"\s+", " ", strip_accents(s or "").lower()).strip()
|
||||
|
||||
|
||||
def sanitize_key(qcode):
|
||||
"""Qcode -> klíč do answers{} (MongoDB-safe): '#' a '.' -> '_'."""
|
||||
return qcode.replace("#", "_").replace(".", "_")
|
||||
|
||||
|
||||
def qnum(qcode):
|
||||
"""Číslo otázky z Qcode (Q63#1_2 -> 63, Q40_6_TEXT -> 40)."""
|
||||
m = re.match(r"Q(\d+)", qcode)
|
||||
return int(m.group(1)) if m else None
|
||||
|
||||
|
||||
def qbase(qcode):
|
||||
"""Logická báze otázky (Q63#1_2 -> Q63, Q40_6 -> Q40, Q25 -> Q25)."""
|
||||
m = re.match(r"(Q\d+)", qcode)
|
||||
return m.group(1) if m else qcode
|
||||
|
||||
|
||||
def import_id(h3_cell):
|
||||
try:
|
||||
return json.loads(h3_cell).get("ImportId", "")
|
||||
except Exception:
|
||||
return h3_cell
|
||||
|
||||
|
||||
def split_text(text):
|
||||
"""Vrátí (stem, item_label). Stem = text otázky, item_label = popisek podčásti."""
|
||||
parts = [p.strip() for p in re.split(r"\s+-\s+", text)]
|
||||
stem = parts[0]
|
||||
if len(parts) == 1:
|
||||
return stem, None
|
||||
# poslední část = popisek řádku/části; vyčisti Qualtrics artefakty
|
||||
label_parts = parts[1:]
|
||||
# zahodit "Selected Choice" (artefakt single-choice s Other)
|
||||
label_parts = [p for p in label_parts if p.lower() != "selected choice"]
|
||||
# zahodit interní statement kód typu "Q63#1"
|
||||
label_parts = [p for p in label_parts if not re.fullmatch(r"Q\d+#\d+", p)]
|
||||
label = " - ".join(label_parts) if label_parts else None
|
||||
return stem, label
|
||||
|
||||
|
||||
def detect_type(qcode, observed):
|
||||
"""Heuristika typu otázky z Qcode a pozorovaných hodnot."""
|
||||
has_hash = "#" in qcode
|
||||
vals = [v for v in observed if v]
|
||||
yesno = vals and all(v in ("Yes", "No") for v in vals)
|
||||
numeric = vals and all(re.fullmatch(r"-?\d+(\.\d+)?", v) for v in vals)
|
||||
if has_hash and yesno:
|
||||
return "matrix_yesno"
|
||||
if has_hash and numeric:
|
||||
return "matrix_percent"
|
||||
if has_hash:
|
||||
return "matrix"
|
||||
if numeric:
|
||||
return "numeric"
|
||||
if yesno:
|
||||
return "yesno"
|
||||
return "single_or_text"
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
def load_csv(path):
|
||||
with open(path, encoding="utf-8-sig", newline="") as fh:
|
||||
rows = list(csv.reader(fh))
|
||||
h1, h2, h3 = rows[0], rows[1], rows[2]
|
||||
data = rows[3:]
|
||||
cols = []
|
||||
for i, (code, text, imp) in enumerate(zip(h1, h2, h3)):
|
||||
cols.append({"i": i, "code": code, "text": text, "qid": import_id(imp)})
|
||||
return cols, data
|
||||
|
||||
|
||||
def col_getter(cols, data):
|
||||
idx = {c["code"]: c["i"] for c in cols}
|
||||
def get(row, code):
|
||||
i = idx.get(code)
|
||||
return (row[i].strip() if i is not None and i < len(row) else "")
|
||||
return get, idx
|
||||
|
||||
|
||||
def is_question_col(code):
|
||||
return bool(re.match(r"Q\d", code))
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
def build_questions(cols, data):
|
||||
"""Slovník otázek -> list dokumentů (1 = 1 logická otázka)."""
|
||||
# observed hodnoty per Qcode (pro typ + options)
|
||||
qcols = [c for c in cols if is_question_col(c["code"])]
|
||||
observed = {c["code"]: set() for c in qcols}
|
||||
for row in data:
|
||||
for c in qcols:
|
||||
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
|
||||
if v:
|
||||
observed[c["code"]].add(v)
|
||||
|
||||
groups = {} # base -> dict
|
||||
order_seen = []
|
||||
for c in qcols:
|
||||
base = qbase(c["code"])
|
||||
if base not in groups:
|
||||
groups[base] = {
|
||||
"_id": base,
|
||||
"order": c["i"],
|
||||
"qnum": qnum(c["code"]),
|
||||
"section": SECTION_BY_QNUM.get(qnum(c["code"]), "Other"),
|
||||
"qids": [],
|
||||
"text": split_text(c["text"])[0],
|
||||
"items": [],
|
||||
"_obs": set(),
|
||||
"_types": [],
|
||||
}
|
||||
order_seen.append(base)
|
||||
g = groups[base]
|
||||
base_qid = re.match(r"(QID\d+)", c["qid"] or "")
|
||||
if base_qid and base_qid.group(1) not in g["qids"]:
|
||||
g["qids"].append(base_qid.group(1))
|
||||
stem, label = split_text(c["text"])
|
||||
key = sanitize_key(c["code"])
|
||||
item = {"key": key, "qcode": c["code"], "qid": c["qid"]}
|
||||
if label:
|
||||
item["label"] = label
|
||||
g["items"].append(item)
|
||||
g["_obs"] |= observed[c["code"]]
|
||||
g["_types"].append(detect_type(c["code"], observed[c["code"]]))
|
||||
|
||||
out = []
|
||||
for n, base in enumerate(order_seen):
|
||||
g = groups[base]
|
||||
obs = sorted(g.pop("_obs"))
|
||||
types = g.pop("_types")
|
||||
# typ skupiny: nejčastější netriviální
|
||||
gtype = max(set(types), key=types.count) if types else "single_or_text"
|
||||
g["type"] = gtype
|
||||
# options jen u kategorických (yesno/single)
|
||||
if gtype in ("yesno", "matrix_yesno"):
|
||||
g["options"] = ["Yes", "No"]
|
||||
elif gtype == "single_or_text" and obs and len(obs) <= 12:
|
||||
g["options"] = obs
|
||||
else:
|
||||
g["options"] = []
|
||||
if base in STEM_OVERRIDE:
|
||||
g["text"] = STEM_OVERRIDE[base]
|
||||
g["order"] = n # přečíslovat 0..N dle pořadí v CSV
|
||||
# pokud má jen 1 item bez labelu, items vynech (je to prostá otázka)
|
||||
if len(g["items"]) == 1 and "label" not in g["items"][0]:
|
||||
g["items"] = []
|
||||
out.append(g)
|
||||
return out
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
def build_response(cols, get, row, source_file):
|
||||
rid = get(row, "ResponseId")
|
||||
answers = {}
|
||||
for c in cols:
|
||||
if is_question_col(c["code"]):
|
||||
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
|
||||
if v:
|
||||
answers[sanitize_key(c["code"])] = v
|
||||
|
||||
def g(*names):
|
||||
for nm in names:
|
||||
v = get(row, nm)
|
||||
if v:
|
||||
return v
|
||||
return None
|
||||
|
||||
meta = {
|
||||
"start_date": get(row, "StartDate") or None,
|
||||
"end_date": get(row, "EndDate") or None,
|
||||
"recorded_date": get(row, "RecordedDate") or None,
|
||||
"status": get(row, "Status") or None,
|
||||
"progress": int(get(row, "Progress")) if get(row, "Progress").isdigit() else get(row, "Progress") or None,
|
||||
"finished": get(row, "Finished") in ("True", "1", "TRUE"),
|
||||
"duration_sec": int(get(row, "Duration (in seconds)")) if get(row, "Duration (in seconds)").isdigit() else None,
|
||||
"user_language": get(row, "UserLanguage") or None,
|
||||
"distribution_channel": get(row, "DistributionChannel") or None,
|
||||
"ip_address": get(row, "IPAddress") or None,
|
||||
"location_lat": get(row, "LocationLatitude") or None,
|
||||
"location_lng": get(row, "LocationLongitude") or None,
|
||||
"survey_date": get(row, "Date") or None,
|
||||
"survey_time": get(row, "Time") or None,
|
||||
}
|
||||
|
||||
doc = {
|
||||
"_id": rid,
|
||||
"study": "77242113UCO3002",
|
||||
"site_country": get(row, "site_country") or None,
|
||||
"site_name": get(row, "site_name") or None,
|
||||
"site_city": get(row, "site_city") or None,
|
||||
"site_state": get(row, "site_state") or None,
|
||||
"site_postcode": get(row, "site_postcode") or None,
|
||||
"site_address": get(row, "site_address") or None,
|
||||
"pi_first_name": get(row, "pi_first_name") or None,
|
||||
"pi_last_name": get(row, "pi_last_name") or None,
|
||||
"pi_email": (get(row, "pi_email") or "").lower() or None,
|
||||
"pi_phone": get(row, "pi_phone") or None,
|
||||
"sdl_site_id": get(row, "sdl_site_id") or None,
|
||||
"fire_site_id": get(row, "fire_site_id") or None,
|
||||
"fire_investigator_id": get(row, "fire_investigator_id") or None,
|
||||
"mailinglist_id": get(row, "mailinglist_id") or None,
|
||||
"survey_generated_by": get(row, "survey_generated_by") or None,
|
||||
"recipient_email": (get(row, "RecipientEmail") or "").lower() or None,
|
||||
"recipient_last_name": get(row, "RecipientLastName") or None,
|
||||
"recipient_first_name": get(row, "RecipientFirstName") or None,
|
||||
"meta": meta,
|
||||
"is_full_sipiq": any(k.startswith(("Q57", "Q58", "Q59", "Q63", "Q66", "Q71")) for k in answers),
|
||||
"interested": answers.get("Q25"),
|
||||
"answers": answers,
|
||||
"investigator_oid": None,
|
||||
"investigator_match": None,
|
||||
"source_file": source_file,
|
||||
}
|
||||
return doc
|
||||
|
||||
|
||||
def content_hash(doc):
|
||||
payload = {k: doc[k] for k in doc if k not in
|
||||
("content_sha256", "first_imported_at", "last_seen_at", "last_updated_at", "history",
|
||||
"investigator_oid", "investigator_match", "source_file")}
|
||||
blob = json.dumps(payload, sort_keys=True, ensure_ascii=False, default=str)
|
||||
return hashlib.sha256(blob.encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
def load_investigators(db):
|
||||
inv = list(db.investigators.find(
|
||||
{"zeme": {"$in": ["Czech Republic", "Slovakia"]}},
|
||||
{"prijmeni": 1, "jmeno": 1, "email": 1, "email2": 1, "zeme": 1, "KROK": 1, "pracoviste": 1},
|
||||
))
|
||||
by_email = {}
|
||||
by_name = {}
|
||||
for d in inv:
|
||||
for ef in ("email", "email2"):
|
||||
e = (d.get(ef) or "").lower().strip()
|
||||
if e:
|
||||
by_email.setdefault(e, d)
|
||||
nm = norm_name(d.get("prijmeni"))
|
||||
if nm:
|
||||
by_name.setdefault((nm, d.get("zeme")), []).append(d)
|
||||
return inv, by_email, by_name
|
||||
|
||||
|
||||
def soft_link(doc, by_email, by_name):
|
||||
e = (doc.get("pi_email") or "").lower().strip()
|
||||
if e and e in by_email:
|
||||
d = by_email[e]
|
||||
return d["_id"], f"email:{e}", d
|
||||
e2 = (doc.get("recipient_email") or "").lower().strip()
|
||||
if e2 and e2 in by_email:
|
||||
d = by_email[e2]
|
||||
return d["_id"], f"recipient_email:{e2}", d
|
||||
nm = norm_name(doc.get("pi_last_name"))
|
||||
cand = by_name.get((nm, doc.get("site_country")), [])
|
||||
if len(cand) == 1:
|
||||
return cand[0]["_id"], f"prijmeni:{nm}", cand[0]
|
||||
if len(cand) > 1:
|
||||
return None, f"prijmeni_ambiguous:{nm}({len(cand)})", None
|
||||
return None, "NENALEZENO", None
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--csv", required=True)
|
||||
ap.add_argument("--scope", choices=["czsk", "all"], default="czsk")
|
||||
ap.add_argument("--apply", action="store_true", help="ostrý zápis (jinak dry-run)")
|
||||
ap.add_argument("--dry-run", action="store_true")
|
||||
args = ap.parse_args()
|
||||
dry = not args.apply
|
||||
source_file = args.csv.replace("\\", "/").split("/")[-1]
|
||||
|
||||
cols, data = load_csv(args.csv)
|
||||
get, idx = col_getter(cols, data)
|
||||
|
||||
# filtr rozsahu
|
||||
if args.scope == "czsk":
|
||||
data = [r for r in data if get(r, "site_country") in ("Czech Republic", "Slovakia")]
|
||||
print(f"Zdroj: {source_file} | rozsah={args.scope} | odpovědí k importu: {len(data)}")
|
||||
|
||||
# --- slovník otázek (staví se z PLNÉHO CSV, ne jen scope) ---
|
||||
cols_all, data_all = load_csv(args.csv)
|
||||
questions = build_questions(cols_all, data_all)
|
||||
print(f"Slovník otázek: {len(questions)} logických otázek "
|
||||
f"(z toho {sum(1 for q in questions if q['items'])} vícedílných).")
|
||||
|
||||
# --- Mongo ---
|
||||
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=8000)
|
||||
db = client[DB_NAME]
|
||||
client.admin.command("ping")
|
||||
inv, by_email, by_name = load_investigators(db)
|
||||
print(f"Investigatorů CZ+SK v DB: {len(inv)}")
|
||||
|
||||
# --- response dokumenty + soft-link ---
|
||||
docs = []
|
||||
link_rows = []
|
||||
for r in data:
|
||||
doc = build_response(cols, get, r, source_file)
|
||||
oid, how, matched = soft_link(doc, by_email, by_name)
|
||||
doc["investigator_oid"] = oid
|
||||
doc["investigator_match"] = how
|
||||
doc["content_sha256"] = content_hash(doc)
|
||||
docs.append(doc)
|
||||
link_rows.append((doc, how, matched))
|
||||
|
||||
# --- delta proti DB ---
|
||||
existing = {d["_id"]: d for d in db[COL_R].find({}, {"content_sha256": 1})}
|
||||
to_insert = [d for d in docs if d["_id"] not in existing]
|
||||
to_update, unchanged = [], []
|
||||
for d in docs:
|
||||
if d["_id"] in existing:
|
||||
if existing[d["_id"]].get("content_sha256") != d["content_sha256"]:
|
||||
to_update.append(d)
|
||||
else:
|
||||
unchanged.append(d)
|
||||
|
||||
# ===================== REPORT =====================
|
||||
print("\n=== SOFT-LINK na investigators ===")
|
||||
matched_k7 = matched_other = unmatched = 0
|
||||
for doc, how, m in link_rows:
|
||||
krok = (m or {}).get("KROK", "")
|
||||
tag = "✓" if m else "✗"
|
||||
if m and str(krok).startswith("7"):
|
||||
matched_k7 += 1
|
||||
elif m:
|
||||
matched_other += 1
|
||||
else:
|
||||
unmatched += 1
|
||||
print(f" {tag} {doc.get('site_country','?')[:2]} {str(doc.get('pi_last_name'))[:18]:18} "
|
||||
f"{str(doc.get('pi_email'))[:32]:32} -> {how[:40]:40} {('KROK '+str(krok)) if m else ''}")
|
||||
print(f" Souhrn: napárováno KROK7={matched_k7}, jiný KROK={matched_other}, nenapárováno={unmatched}")
|
||||
|
||||
print("\n=== DELTA ===")
|
||||
print(f" INSERT (nové): {len(to_insert)}")
|
||||
print(f" UPDATE (změněné): {len(to_update)}")
|
||||
print(f" beze změny: {len(unchanged)}")
|
||||
|
||||
# ukázka 1 dokumentu
|
||||
if docs:
|
||||
s = dict(docs[0])
|
||||
s["answers"] = {k: s["answers"][k] for k in list(s["answers"])[:6]}
|
||||
s["answers"]["…"] = f"(+{len(docs[0]['answers'])-6} dalších)"
|
||||
print("\n=== UKÁZKA response dokumentu (zkráceno) ===")
|
||||
print(json.dumps(s, ensure_ascii=False, indent=2, default=str)[:1800])
|
||||
|
||||
if dry:
|
||||
print("\n[DRY-RUN] Nic se nezapsalo. Ostrý běh: přidej --apply")
|
||||
client.close()
|
||||
return
|
||||
|
||||
# ===================== ZÁPIS =====================
|
||||
# 1) slovník otázek (idempotentní upsert)
|
||||
nq = 0
|
||||
for q in questions:
|
||||
db[COL_Q].replace_one({"_id": q["_id"]}, q, upsert=True)
|
||||
nq += 1
|
||||
print(f"\n[APPLY] sipiq_questions: upsertnuto {nq}")
|
||||
|
||||
# 2) responses (delta)
|
||||
ts = now_iso()
|
||||
ni = nu = ns = 0
|
||||
for d in docs:
|
||||
cur = db[COL_R].find_one({"_id": d["_id"]})
|
||||
if cur is None:
|
||||
d["first_imported_at"] = ts
|
||||
d["last_seen_at"] = ts
|
||||
d["last_updated_at"] = ts
|
||||
d["history"] = []
|
||||
db[COL_R].insert_one(d)
|
||||
ni += 1
|
||||
elif cur.get("content_sha256") != d["content_sha256"]:
|
||||
changes = diff_docs(cur, d)
|
||||
db[COL_R].update_one({"_id": d["_id"]}, {
|
||||
"$set": {**{k: d[k] for k in d if k not in ("_id",)},
|
||||
"last_seen_at": ts, "last_updated_at": ts},
|
||||
"$push": {"history": {"changed_at": ts, "source_file": source_file, "changes": changes}},
|
||||
})
|
||||
nu += 1
|
||||
else:
|
||||
db[COL_R].update_one({"_id": d["_id"]},
|
||||
{"$set": {"last_seen_at": ts, "source_file": source_file}})
|
||||
ns += 1
|
||||
print(f"[APPLY] sipiq_responses: insert={ni}, update={nu}, beze změny={ns}")
|
||||
client.close()
|
||||
|
||||
|
||||
def diff_docs(old, new):
|
||||
"""Field-level diff pro history (jen answers + povýšená pole + meta)."""
|
||||
changes = []
|
||||
def walk(prefix, o, n):
|
||||
keys = set((o or {}).keys()) | set((n or {}).keys())
|
||||
for k in sorted(keys):
|
||||
ov, nv = (o or {}).get(k), (n or {}).get(k)
|
||||
if isinstance(ov, dict) or isinstance(nv, dict):
|
||||
walk(f"{prefix}{k}.", ov or {}, nv or {})
|
||||
elif ov != nv:
|
||||
changes.append({"key": f"{prefix}{k}", "old": ov, "new": nv})
|
||||
for field in ("answers", "meta"):
|
||||
walk(f"{field}.", old.get(field, {}), new.get(field, {}))
|
||||
for k in ("site_name", "pi_email", "pi_last_name", "interested", "is_full_sipiq"):
|
||||
if old.get(k) != new.get(k):
|
||||
changes.append({"key": k, "old": old.get(k), "new": new.get(k)})
|
||||
return changes
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,47 @@
|
||||
# sipiq_import_v1.1 — import SIPIQ odpovědí do MongoDB (folder workflow)
|
||||
|
||||
**Verze:** 1.1 · **Datum:** 2026-06-17 · **Studie:** 77242113UCO3002 (ICONIC / DAWN)
|
||||
|
||||
## Změny proti v1.0
|
||||
- **FOLDER WORKFLOW** (`--folder`): sebere všechna `*.csv` ve složce, naimportuje (delta)
|
||||
a po úspěšném zpracování **přesune soubor do podsložky `Zpracováno`**.
|
||||
Default složka = `U:\PythonProject\Janssen\Feasibility\77242113UCO2001\ImportSIPIQcompled`.
|
||||
Vzor Incoming/Processed (jako IWRS / Panorama). Stará v1.0 → `Feasibility\TRASH`.
|
||||
|
||||
## Účel a kolekce
|
||||
(stejné jako v1.0) Import Qualtrics CSV exportu do db `feasibility`:
|
||||
- `sipiq_questions` — slovník dotazníku (rekonstrukce SIPIQ jako v PDF).
|
||||
- `sipiq_responses` — 1 dok = 1 odpověď (`_id`=ResponseId), ploché `answers{}`,
|
||||
soft-link `investigator_oid`, delta + `history[]`.
|
||||
|
||||
Zdroj = CSV (ř.1 Qcode, ř.2 text otázky, ř.3 ImportId=QID). XML neobsahuje text otázky.
|
||||
|
||||
## Delta import (přepíše JEN změněná data)
|
||||
nová→INSERT; beze změn (shodný `content_sha256`)→jen `last_seen_at`;
|
||||
změna→`$set` jen změněných polí + `$push` do `history[]`.
|
||||
|
||||
## Soft-link na investigators (nedestruktivní)
|
||||
pi_email → email/email2 (lower), pak recipient_email, fallback příjmení (bez diakritiky)+země.
|
||||
|
||||
## Použití
|
||||
```
|
||||
# folder režim (default složka): zpracuje vše a přesune do Zpracováno
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.1.py --dry-run
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.1.py --apply
|
||||
# jiná složka
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.1.py --folder "<cesta>" --apply
|
||||
# jediný soubor (NEpřesouvá)
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.1.py --csv "<cesta.csv>" --apply
|
||||
```
|
||||
`--scope czsk` (default) / `all`. Default = dry-run, ostrý = `--apply`.
|
||||
Přesun do `Zpracováno` proběhne JEN v `--apply` a JEN ve folder režimu (ne u `--csv`).
|
||||
Kolize jmen v Zpracováno → přípona `_N`.
|
||||
|
||||
## Workflow (domluva 17JUN2026)
|
||||
Uživatel pokládá kompletní SIPIQ reporty (Qualtrics CSV) do `ImportSIPIQcompled\`.
|
||||
Po zpracování skript přesune soubor do `ImportSIPIQcompled\Zpracováno\`. Delta zajistí,
|
||||
že opakovaný/rozšířený export jen doplní nové/změněné odpovědi (zbytek beze změny).
|
||||
|
||||
## Stav 17JUN2026
|
||||
Folder + Zpracováno připraveny. Iniciální import (15 CZ+SK z 06.06 exportu) proveden ještě v1.0:
|
||||
`sipiq_questions`:56, `sipiq_responses`:15, soft-link 15/15 přes e-mail = KROK 7.
|
||||
@@ -0,0 +1,480 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
sipiq_import_v1.1.py
|
||||
====================
|
||||
Verze: 1.1
|
||||
Datum: 2026-06-17
|
||||
Autor: Claude Code (pro MUDr. Vladimíra Buzalku)
|
||||
|
||||
Změny proti v1.0
|
||||
----------------
|
||||
- FOLDER WORKFLOW: režim --folder sebere všechna *.csv ve složce, naimportuje (delta)
|
||||
a po úspěšném zpracování přesune soubor do podsložky `Zpracováno`. Default složka =
|
||||
U:\\PythonProject\\Janssen\\Feasibility\\77242113UCO2001\\ImportSIPIQcompled.
|
||||
(Vzor Incoming/Processed jako IWRS / Panorama.) Stará v1.0 ponechána v TRASH.
|
||||
|
||||
Popis
|
||||
-----
|
||||
Import SIPIQ odpovědí (Qualtrics CSV export, studie 77242113UCO3002 / ICONIC DAWN)
|
||||
do MongoDB db `feasibility`. Dvě kolekce:
|
||||
* sipiq_questions – slovník dotazníku (1 dok = 1 logická otázka).
|
||||
* sipiq_responses – 1 dok = 1 odpověď (_id = Qualtrics ResponseId), ploché answers{},
|
||||
soft-link investigator_oid, delta bookkeeping + history[].
|
||||
|
||||
DELTA import (přepíše JEN změněná data): nová->insert; beze změn->jen last_seen_at;
|
||||
změna->$set jen změněných polí + push do history[].
|
||||
|
||||
Použití
|
||||
-------
|
||||
# folder režim (default složka): zpracuje vše a přesune do Zpracováno
|
||||
python sipiq_import_v1.1.py --dry-run
|
||||
python sipiq_import_v1.1.py --apply
|
||||
# konkrétní složka
|
||||
python sipiq_import_v1.1.py --folder "<cesta>" --apply
|
||||
# jediný soubor (NEpřesouvá)
|
||||
python sipiq_import_v1.1.py --csv "<cesta.csv>" --apply
|
||||
|
||||
Závislosti: pymongo (.venv). Mongo 192.168.1.76:27017, bez auth.
|
||||
"""
|
||||
import argparse
|
||||
import csv
|
||||
import glob
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import unicodedata
|
||||
from datetime import datetime, timezone
|
||||
|
||||
try:
|
||||
from pymongo import MongoClient
|
||||
except ImportError:
|
||||
print("CHYBA: pymongo není nainstalován v aktuálním pythonu.", file=sys.stderr)
|
||||
raise
|
||||
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
DB_NAME = "feasibility"
|
||||
COL_Q = "sipiq_questions"
|
||||
COL_R = "sipiq_responses"
|
||||
DEFAULT_FOLDER = r"U:\PythonProject\Janssen\Feasibility\77242113UCO2001\ImportSIPIQcompled"
|
||||
PROCESSED_SUBDIR = "Zpracováno"
|
||||
|
||||
META_COLS = {
|
||||
"StartDate", "EndDate", "Status", "IPAddress", "Progress", "Duration (in seconds)",
|
||||
"Finished", "RecordedDate", "ResponseId", "RecipientLastName", "RecipientFirstName",
|
||||
"RecipientEmail", "ExternalReference", "LocationLatitude", "LocationLongitude",
|
||||
"DistributionChannel", "UserLanguage",
|
||||
}
|
||||
|
||||
PROMOTE = [
|
||||
"site_name", "site_address", "site_city", "site_state", "site_postcode", "site_country",
|
||||
"pi_first_name", "pi_last_name", "pi_phone", "pi_email",
|
||||
"sdl_site_id", "fire_site_id", "fire_investigator_id", "mailinglist_id",
|
||||
"survey_generated_by", "Date", "Time",
|
||||
]
|
||||
|
||||
SECTION_BY_QNUM = {}
|
||||
def _sec(rng, name):
|
||||
for n in rng:
|
||||
SECTION_BY_QNUM[n] = name
|
||||
_sec([2], "J&J Internal Assessment")
|
||||
_sec([6, 7, 8, 9, 10, 11, 12, 13], "Contact Information")
|
||||
_sec(range(14, 22), "Confidentiality Statement")
|
||||
_sec([25, 26, 27], "Interest")
|
||||
_sec([29, 30, 31, 32, 33, 34], "Protocol Requirements")
|
||||
_sec([36, 37, 38], "Enrollment")
|
||||
_sec([40, 41, 42, 43], "Patient Demographics Overview")
|
||||
_sec([45, 46, 47, 48, 49], "Site Overview")
|
||||
_sec([51], "Operational Considerations")
|
||||
_sec([53, 54], "Comments")
|
||||
_sec([57, 58, 59, 60, 61], "Patient Population")
|
||||
_sec([63, 64, 65, 66, 67], "Site Experience and Staffing")
|
||||
_sec([69], "Equipment and Facility Requirements")
|
||||
_sec([71, 72, 73, 74, 75], "Institutional Review Board, Ethics Committee, and Contracts")
|
||||
|
||||
STEM_OVERRIDE = {
|
||||
"Q31": "At your site, at what line(s) of treatment do you most commonly prescribe "
|
||||
"vedolizumab for patients with moderately to severely active ulcerative colitis?",
|
||||
"Q63": "Do you or your site staff have experience in performing the following types of "
|
||||
"study assessments/procedures?",
|
||||
"Q64": "The following personnel are required to run the study. "
|
||||
"Will your site have the following available?",
|
||||
"Q69": "The following equipment and facilities are required to run the studies. "
|
||||
"Are these available at your site?",
|
||||
}
|
||||
|
||||
|
||||
def now_iso():
|
||||
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
|
||||
|
||||
|
||||
def strip_accents(s):
|
||||
if not s:
|
||||
return ""
|
||||
return "".join(c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c))
|
||||
|
||||
|
||||
def norm_name(s):
|
||||
return re.sub(r"\s+", " ", strip_accents(s or "").lower()).strip()
|
||||
|
||||
|
||||
def sanitize_key(qcode):
|
||||
return qcode.replace("#", "_").replace(".", "_")
|
||||
|
||||
|
||||
def qnum(qcode):
|
||||
m = re.match(r"Q(\d+)", qcode)
|
||||
return int(m.group(1)) if m else None
|
||||
|
||||
|
||||
def qbase(qcode):
|
||||
m = re.match(r"(Q\d+)", qcode)
|
||||
return m.group(1) if m else qcode
|
||||
|
||||
|
||||
def import_id(h3_cell):
|
||||
try:
|
||||
return json.loads(h3_cell).get("ImportId", "")
|
||||
except Exception:
|
||||
return h3_cell
|
||||
|
||||
|
||||
def split_text(text):
|
||||
parts = [p.strip() for p in re.split(r"\s+-\s+", text)]
|
||||
stem = parts[0]
|
||||
if len(parts) == 1:
|
||||
return stem, None
|
||||
label_parts = [p for p in parts[1:] if p.lower() != "selected choice"]
|
||||
label_parts = [p for p in label_parts if not re.fullmatch(r"Q\d+#\d+", p)]
|
||||
return stem, (" - ".join(label_parts) if label_parts else None)
|
||||
|
||||
|
||||
def detect_type(qcode, observed):
|
||||
has_hash = "#" in qcode
|
||||
vals = [v for v in observed if v]
|
||||
yesno = vals and all(v in ("Yes", "No") for v in vals)
|
||||
numeric = vals and all(re.fullmatch(r"-?\d+(\.\d+)?", v) for v in vals)
|
||||
if has_hash and yesno:
|
||||
return "matrix_yesno"
|
||||
if has_hash and numeric:
|
||||
return "matrix_percent"
|
||||
if has_hash:
|
||||
return "matrix"
|
||||
if numeric:
|
||||
return "numeric"
|
||||
if yesno:
|
||||
return "yesno"
|
||||
return "single_or_text"
|
||||
|
||||
|
||||
def load_csv(path):
|
||||
with open(path, encoding="utf-8-sig", newline="") as fh:
|
||||
rows = list(csv.reader(fh))
|
||||
h1, h2, h3 = rows[0], rows[1], rows[2]
|
||||
data = rows[3:]
|
||||
cols = [{"i": i, "code": c, "text": t, "qid": import_id(j)}
|
||||
for i, (c, t, j) in enumerate(zip(h1, h2, h3))]
|
||||
return cols, data
|
||||
|
||||
|
||||
def col_getter(cols, data):
|
||||
idx = {c["code"]: c["i"] for c in cols}
|
||||
def get(row, code):
|
||||
i = idx.get(code)
|
||||
return (row[i].strip() if i is not None and i < len(row) else "")
|
||||
return get, idx
|
||||
|
||||
|
||||
def is_question_col(code):
|
||||
return bool(re.match(r"Q\d", code))
|
||||
|
||||
|
||||
def build_questions(cols, data):
|
||||
qcols = [c for c in cols if is_question_col(c["code"])]
|
||||
observed = {c["code"]: set() for c in qcols}
|
||||
for row in data:
|
||||
for c in qcols:
|
||||
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
|
||||
if v:
|
||||
observed[c["code"]].add(v)
|
||||
groups, order_seen = {}, []
|
||||
for c in qcols:
|
||||
base = qbase(c["code"])
|
||||
if base not in groups:
|
||||
groups[base] = {"_id": base, "order": c["i"], "qnum": qnum(c["code"]),
|
||||
"section": SECTION_BY_QNUM.get(qnum(c["code"]), "Other"),
|
||||
"qids": [], "text": split_text(c["text"])[0],
|
||||
"items": [], "_obs": set(), "_types": []}
|
||||
order_seen.append(base)
|
||||
g = groups[base]
|
||||
bq = re.match(r"(QID\d+)", c["qid"] or "")
|
||||
if bq and bq.group(1) not in g["qids"]:
|
||||
g["qids"].append(bq.group(1))
|
||||
_, label = split_text(c["text"])
|
||||
item = {"key": sanitize_key(c["code"]), "qcode": c["code"], "qid": c["qid"]}
|
||||
if label:
|
||||
item["label"] = label
|
||||
g["items"].append(item)
|
||||
g["_obs"] |= observed[c["code"]]
|
||||
g["_types"].append(detect_type(c["code"], observed[c["code"]]))
|
||||
out = []
|
||||
for n, base in enumerate(order_seen):
|
||||
g = groups[base]
|
||||
obs = sorted(g.pop("_obs"))
|
||||
types = g.pop("_types")
|
||||
gtype = max(set(types), key=types.count) if types else "single_or_text"
|
||||
g["type"] = gtype
|
||||
if gtype in ("yesno", "matrix_yesno"):
|
||||
g["options"] = ["Yes", "No"]
|
||||
elif gtype == "single_or_text" and obs and len(obs) <= 12:
|
||||
g["options"] = obs
|
||||
else:
|
||||
g["options"] = []
|
||||
if base in STEM_OVERRIDE:
|
||||
g["text"] = STEM_OVERRIDE[base]
|
||||
g["order"] = n
|
||||
if len(g["items"]) == 1 and "label" not in g["items"][0]:
|
||||
g["items"] = []
|
||||
out.append(g)
|
||||
return out
|
||||
|
||||
|
||||
def build_response(cols, get, row, source_file):
|
||||
rid = get(row, "ResponseId")
|
||||
answers = {}
|
||||
for c in cols:
|
||||
if is_question_col(c["code"]):
|
||||
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
|
||||
if v:
|
||||
answers[sanitize_key(c["code"])] = v
|
||||
meta = {
|
||||
"start_date": get(row, "StartDate") or None,
|
||||
"end_date": get(row, "EndDate") or None,
|
||||
"recorded_date": get(row, "RecordedDate") or None,
|
||||
"status": get(row, "Status") or None,
|
||||
"progress": int(get(row, "Progress")) if get(row, "Progress").isdigit() else (get(row, "Progress") or None),
|
||||
"finished": get(row, "Finished") in ("True", "1", "TRUE"),
|
||||
"duration_sec": int(get(row, "Duration (in seconds)")) if get(row, "Duration (in seconds)").isdigit() else None,
|
||||
"user_language": get(row, "UserLanguage") or None,
|
||||
"distribution_channel": get(row, "DistributionChannel") or None,
|
||||
"ip_address": get(row, "IPAddress") or None,
|
||||
"location_lat": get(row, "LocationLatitude") or None,
|
||||
"location_lng": get(row, "LocationLongitude") or None,
|
||||
"survey_date": get(row, "Date") or None,
|
||||
"survey_time": get(row, "Time") or None,
|
||||
}
|
||||
doc = {
|
||||
"_id": rid, "study": "77242113UCO3002",
|
||||
"site_country": get(row, "site_country") or None,
|
||||
"site_name": get(row, "site_name") or None,
|
||||
"site_city": get(row, "site_city") or None,
|
||||
"site_state": get(row, "site_state") or None,
|
||||
"site_postcode": get(row, "site_postcode") or None,
|
||||
"site_address": get(row, "site_address") or None,
|
||||
"pi_first_name": get(row, "pi_first_name") or None,
|
||||
"pi_last_name": get(row, "pi_last_name") or None,
|
||||
"pi_email": (get(row, "pi_email") or "").lower() or None,
|
||||
"pi_phone": get(row, "pi_phone") or None,
|
||||
"sdl_site_id": get(row, "sdl_site_id") or None,
|
||||
"fire_site_id": get(row, "fire_site_id") or None,
|
||||
"fire_investigator_id": get(row, "fire_investigator_id") or None,
|
||||
"mailinglist_id": get(row, "mailinglist_id") or None,
|
||||
"survey_generated_by": get(row, "survey_generated_by") or None,
|
||||
"recipient_email": (get(row, "RecipientEmail") or "").lower() or None,
|
||||
"recipient_last_name": get(row, "RecipientLastName") or None,
|
||||
"recipient_first_name": get(row, "RecipientFirstName") or None,
|
||||
"meta": meta,
|
||||
"is_full_sipiq": any(k.startswith(("Q57", "Q58", "Q59", "Q63", "Q66", "Q71")) for k in answers),
|
||||
"interested": answers.get("Q25"),
|
||||
"answers": answers,
|
||||
"investigator_oid": None, "investigator_match": None,
|
||||
"source_file": source_file,
|
||||
}
|
||||
return doc
|
||||
|
||||
|
||||
def content_hash(doc):
|
||||
payload = {k: doc[k] for k in doc if k not in
|
||||
("content_sha256", "first_imported_at", "last_seen_at", "last_updated_at",
|
||||
"history", "investigator_oid", "investigator_match", "source_file")}
|
||||
return hashlib.sha256(json.dumps(payload, sort_keys=True, ensure_ascii=False,
|
||||
default=str).encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def load_investigators(db):
|
||||
inv = list(db.investigators.find(
|
||||
{"zeme": {"$in": ["Czech Republic", "Slovakia"]}},
|
||||
{"prijmeni": 1, "jmeno": 1, "email": 1, "email2": 1, "zeme": 1, "KROK": 1}))
|
||||
by_email, by_name = {}, {}
|
||||
for d in inv:
|
||||
for ef in ("email", "email2"):
|
||||
e = (d.get(ef) or "").lower().strip()
|
||||
if e:
|
||||
by_email.setdefault(e, d)
|
||||
nm = norm_name(d.get("prijmeni"))
|
||||
if nm:
|
||||
by_name.setdefault((nm, d.get("zeme")), []).append(d)
|
||||
return inv, by_email, by_name
|
||||
|
||||
|
||||
def soft_link(doc, by_email, by_name):
|
||||
e = (doc.get("pi_email") or "").lower().strip()
|
||||
if e and e in by_email:
|
||||
d = by_email[e]; return d["_id"], f"email:{e}", d
|
||||
e2 = (doc.get("recipient_email") or "").lower().strip()
|
||||
if e2 and e2 in by_email:
|
||||
d = by_email[e2]; return d["_id"], f"recipient_email:{e2}", d
|
||||
nm = norm_name(doc.get("pi_last_name"))
|
||||
cand = by_name.get((nm, doc.get("site_country")), [])
|
||||
if len(cand) == 1:
|
||||
return cand[0]["_id"], f"prijmeni:{nm}", cand[0]
|
||||
if len(cand) > 1:
|
||||
return None, f"prijmeni_ambiguous:{nm}({len(cand)})", None
|
||||
return None, "NENALEZENO", None
|
||||
|
||||
|
||||
def diff_docs(old, new):
|
||||
changes = []
|
||||
def walk(prefix, o, n):
|
||||
for k in sorted(set((o or {}).keys()) | set((n or {}).keys())):
|
||||
ov, nv = (o or {}).get(k), (n or {}).get(k)
|
||||
if isinstance(ov, dict) or isinstance(nv, dict):
|
||||
walk(f"{prefix}{k}.", ov or {}, nv or {})
|
||||
elif ov != nv:
|
||||
changes.append({"key": f"{prefix}{k}", "old": ov, "new": nv})
|
||||
for field in ("answers", "meta"):
|
||||
walk(f"{field}.", old.get(field, {}), new.get(field, {}))
|
||||
for k in ("site_name", "pi_email", "pi_last_name", "interested", "is_full_sipiq"):
|
||||
if old.get(k) != new.get(k):
|
||||
changes.append({"key": k, "old": old.get(k), "new": new.get(k)})
|
||||
return changes
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
def process_file(db, csv_path, scope, dry, by_email, by_name):
|
||||
source_file = os.path.basename(csv_path)
|
||||
cols, data = load_csv(csv_path)
|
||||
get, _ = col_getter(cols, data)
|
||||
if scope == "czsk":
|
||||
data = [r for r in data if get(r, "site_country") in ("Czech Republic", "Slovakia")]
|
||||
print(f"\n########## {source_file} (rozsah={scope}, odpovědí={len(data)}) ##########")
|
||||
|
||||
# slovník z plného CSV
|
||||
cols_all, data_all = load_csv(csv_path)
|
||||
questions = build_questions(cols_all, data_all)
|
||||
|
||||
docs, link_rows = [], []
|
||||
for r in data:
|
||||
doc = build_response(cols, get, r, source_file)
|
||||
oid, how, matched = soft_link(doc, by_email, by_name)
|
||||
doc["investigator_oid"] = oid
|
||||
doc["investigator_match"] = how
|
||||
doc["content_sha256"] = content_hash(doc)
|
||||
docs.append(doc)
|
||||
link_rows.append((doc, how, matched))
|
||||
|
||||
existing = {d["_id"]: d for d in db[COL_R].find({}, {"content_sha256": 1})}
|
||||
to_insert = [d for d in docs if d["_id"] not in existing]
|
||||
to_update = [d for d in docs if d["_id"] in existing and existing[d["_id"]].get("content_sha256") != d["content_sha256"]]
|
||||
unchanged = [d for d in docs if d["_id"] in existing and existing[d["_id"]].get("content_sha256") == d["content_sha256"]]
|
||||
|
||||
mk7 = mko = un = 0
|
||||
for doc, how, m in link_rows:
|
||||
krok = (m or {}).get("KROK", "")
|
||||
if m and str(krok).startswith("7"): mk7 += 1
|
||||
elif m: mko += 1
|
||||
else: un += 1
|
||||
print(f" slovník: {len(questions)} otázek | soft-link: KROK7={mk7}, jiný={mko}, nenapárováno={un}")
|
||||
print(f" delta: INSERT={len(to_insert)}, UPDATE={len(to_update)}, beze změny={len(unchanged)}")
|
||||
if un:
|
||||
for doc, how, m in link_rows:
|
||||
if not m:
|
||||
print(f" ✗ NENAPÁROVÁNO: {doc.get('pi_last_name')} / {doc.get('pi_email')} ({how})")
|
||||
|
||||
if dry:
|
||||
print(" [DRY-RUN] nezapsáno")
|
||||
return {"insert": 0, "update": 0, "unchanged": 0, "wrote": False}
|
||||
|
||||
for q in questions:
|
||||
db[COL_Q].replace_one({"_id": q["_id"]}, q, upsert=True)
|
||||
ts = now_iso()
|
||||
ni = nu = ns = 0
|
||||
for d in docs:
|
||||
cur = db[COL_R].find_one({"_id": d["_id"]})
|
||||
if cur is None:
|
||||
d.update({"first_imported_at": ts, "last_seen_at": ts, "last_updated_at": ts, "history": []})
|
||||
db[COL_R].insert_one(d); ni += 1
|
||||
elif cur.get("content_sha256") != d["content_sha256"]:
|
||||
changes = diff_docs(cur, d)
|
||||
db[COL_R].update_one({"_id": d["_id"]}, {
|
||||
"$set": {**{k: d[k] for k in d if k != "_id"}, "last_seen_at": ts, "last_updated_at": ts},
|
||||
"$push": {"history": {"changed_at": ts, "source_file": source_file, "changes": changes}}})
|
||||
nu += 1
|
||||
else:
|
||||
db[COL_R].update_one({"_id": d["_id"]}, {"$set": {"last_seen_at": ts, "source_file": source_file}})
|
||||
ns += 1
|
||||
print(f" [APPLY] questions upsert={len(questions)} | responses insert={ni}, update={nu}, beze změny={ns}")
|
||||
return {"insert": ni, "update": nu, "unchanged": ns, "wrote": True}
|
||||
|
||||
|
||||
def move_to_processed(csv_path, folder):
|
||||
dest_dir = os.path.join(folder, PROCESSED_SUBDIR)
|
||||
os.makedirs(dest_dir, exist_ok=True)
|
||||
base = os.path.basename(csv_path)
|
||||
dest = os.path.join(dest_dir, base)
|
||||
if os.path.exists(dest): # kolize -> přípona _N
|
||||
stem, ext = os.path.splitext(base)
|
||||
n = 1
|
||||
while os.path.exists(os.path.join(dest_dir, f"{stem}_{n}{ext}")):
|
||||
n += 1
|
||||
dest = os.path.join(dest_dir, f"{stem}_{n}{ext}")
|
||||
shutil.move(csv_path, dest)
|
||||
print(f" -> přesunuto do {PROCESSED_SUBDIR}\\{os.path.basename(dest)}")
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--csv", help="jediný soubor (NEpřesouvá)")
|
||||
ap.add_argument("--folder", default=DEFAULT_FOLDER, help="složka se SIPIQ CSV (přesune do Zpracováno)")
|
||||
ap.add_argument("--scope", choices=["czsk", "all"], default="czsk")
|
||||
ap.add_argument("--apply", action="store_true")
|
||||
ap.add_argument("--dry-run", action="store_true")
|
||||
args = ap.parse_args()
|
||||
dry = not args.apply
|
||||
|
||||
if args.csv:
|
||||
files, move_mode, folder = [args.csv], False, None
|
||||
else:
|
||||
folder = args.folder
|
||||
files = sorted(glob.glob(os.path.join(folder, "*.csv")))
|
||||
move_mode = True
|
||||
print(f"Složka: {folder}\nNalezeno CSV ke zpracování: {len(files)}")
|
||||
if not files:
|
||||
print("Nic ke zpracování (žádné *.csv).")
|
||||
return
|
||||
|
||||
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=8000)
|
||||
db = client[DB_NAME]
|
||||
client.admin.command("ping")
|
||||
inv, by_email, by_name = load_investigators(db)
|
||||
print(f"Investigatorů CZ+SK v DB: {len(inv)}")
|
||||
|
||||
total = {"insert": 0, "update": 0, "unchanged": 0}
|
||||
for f in files:
|
||||
res = process_file(db, f, args.scope, dry, by_email, by_name)
|
||||
for k in total:
|
||||
total[k] += res[k]
|
||||
if move_mode and res["wrote"]:
|
||||
move_to_processed(f, folder)
|
||||
|
||||
print(f"\n=== CELKEM: insert={total['insert']}, update={total['update']}, beze změny={total['unchanged']} ===")
|
||||
if dry:
|
||||
print("[DRY-RUN] Nic se nezapsalo ani nepřesunulo. Ostrý běh: --apply")
|
||||
client.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,32 @@
|
||||
# analyze_sent_suspects_v1.0.py
|
||||
|
||||
**Verze:** 1.0 · **Datum:** 2026-06-16
|
||||
|
||||
Lokální (Z230) analyzátor `.msg` přenesených z JNJ (výstup
|
||||
`jnj_scan_failed_sent`). Přes **olefile** projde všechny `.msg` ve složce,
|
||||
u každého vytáhne klíčové MAPI vlastnosti a klasifikuje, zda jde o **neodeslaný**
|
||||
e-mail. Výstup = přehled do konzole + timestampovaný `.xlsx`.
|
||||
|
||||
## Klasifikace
|
||||
- **FAIL_BODY** — tělo/report obsahuje „could not be sent" / „SendAsDenied" / …
|
||||
- **SENDAS_BUZ** — send-account / SentRepresenting / Sender obsahuje `buzalka.cz`
|
||||
- **NO_MSGID** — chybí Internet Message-ID (0x1035)
|
||||
- `failed = ANO`, pokud FAIL_BODY nebo SENDAS_BUZ (skoro jisté neodeslání).
|
||||
|
||||
Vytáhne i **příjemce-lékaře** (externí adresa, ne `its.jnj.com`), subjekt,
|
||||
send-account a Message-ID. Datum bere z názvu souboru (`..._YYYY-MM-DD_...`).
|
||||
|
||||
## Spuštění
|
||||
```
|
||||
python analyze_sent_suspects_v1.0.py [SLOZKA_S_MSG]
|
||||
```
|
||||
- Bez argumentu použije `INPUT_DIR` (default
|
||||
`U:\Dropbox\!!!Days\Downloads Z230\sent_suspects`).
|
||||
- `.xlsx` se uloží do `U:\Dropbox\!!!Days\Downloads Z230\`.
|
||||
- Vyžaduje `olefile` + `openpyxl` (jsou ve venv `U:\janssen\.venv`).
|
||||
|
||||
## Po analýze (další krok)
|
||||
Seznam příjemců s `failed=ANO` = lékaři, kterým **úvodní nabídka nedorazila**.
|
||||
Cross-ref na `feasibility.investigators` ukáže, komu (a v jakém KROK) je třeba
|
||||
poslat nabídku znovu — **se správným From `vbuzalka@its.jnj.com`**.
|
||||
"""
|
||||
@@ -0,0 +1,196 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# =============================================================================
|
||||
# Nazev: analyze_sent_suspects_v1.0.py
|
||||
# Verze: 1.0
|
||||
# Datum: 2026-06-16
|
||||
# Popis: LOKALNI (Z230) analyzator .msg souboru prenesenych z JNJ (vystup
|
||||
# jnj_scan_failed_sent). Pres olefile precte u kazdeho .msg klicove
|
||||
# MAPI vlastnosti a klasifikuje, zda jde o NEODESLANY e-mail:
|
||||
# FAIL_BODY = telo/report obsahuje "could not be sent"/"SendAsDenied"
|
||||
# SENDAS_BUZ = send-account / sentrep / sender obsahuje "buzalka.cz"
|
||||
# NO_MSGID = chybi Internet Message-ID (0x1035)
|
||||
# Vytahne prijemce (externi = lekar), subjekt, send-account, Message-ID.
|
||||
# Vystup: prehled do konzole + timestampovany .xlsx.
|
||||
# Pouziti: python analyze_sent_suspects_v1.0.py [SLOZKA_S_MSG]
|
||||
# (default INPUT_DIR nize). Vyzaduje olefile + openpyxl.
|
||||
# =============================================================================
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import glob
|
||||
import datetime
|
||||
import olefile
|
||||
import openpyxl
|
||||
|
||||
INPUT_DIR = r"U:\Dropbox\!!!Days\Downloads Z230\sent_suspects"
|
||||
OUT_DIR = r"U:\Dropbox\!!!Days\Downloads Z230"
|
||||
|
||||
FAIL_SIGNS = [
|
||||
"could not be sent", "sendasdenied",
|
||||
"permission to send the message on behalf",
|
||||
"transportsend operation has failed", "mapiexceptionsendasdenied",
|
||||
]
|
||||
INTERNAL = ("its.jnj.com",) # interni = ne-lekar (vc. cc Kocourkova/Bartosova)
|
||||
|
||||
|
||||
def rd(o, tag):
|
||||
"""Precti string stream __substg1.0_<tag> (zkousi 001F unicode i 001E ansi)."""
|
||||
for t in (tag, tag[:-1] + "F", tag[:-1] + "E"):
|
||||
name = "__substg1.0_" + t
|
||||
if o.exists(name):
|
||||
b = o.openstream(name).read()
|
||||
if t.endswith("001F"):
|
||||
try:
|
||||
return b.decode("utf-16-le")
|
||||
except Exception:
|
||||
pass
|
||||
for enc in ("cp1250", "latin-1", "utf-8"):
|
||||
try:
|
||||
return b.decode(enc)
|
||||
except Exception:
|
||||
pass
|
||||
return ""
|
||||
|
||||
|
||||
def read_body(o):
|
||||
txt = rd(o, "1000001F") # PR_BODY
|
||||
if not txt:
|
||||
txt = rd(o, "1001001F") # ReportText
|
||||
# PR_HTML (binary) jako fallback
|
||||
if not txt and o.exists("__substg1.0_10130102"):
|
||||
try:
|
||||
txt = o.openstream("__substg1.0_10130102").read().decode("latin-1", "ignore")
|
||||
except Exception:
|
||||
pass
|
||||
return txt or ""
|
||||
|
||||
|
||||
def recipients_smtp(o):
|
||||
"""Posbira SMTP vsech prijemcu z __recip_version1.0_#xxxx storages."""
|
||||
out = []
|
||||
seen = set()
|
||||
for entry in o.listdir():
|
||||
# entry je list segmentu cesty; zajima nas prvni segment recip storage
|
||||
if entry and entry[0].startswith("__recip_version1.0_#") and len(entry) == 2:
|
||||
top = entry[0]
|
||||
if top in seen:
|
||||
continue
|
||||
seen.add(top)
|
||||
smtp = ""
|
||||
for tag in ("39FE001F", "39FE001E", "3003001F", "3003001E", "0C1F001F"):
|
||||
nm = top + "/__substg1.0_" + tag
|
||||
if o.exists(nm):
|
||||
b = o.openstream(nm).read()
|
||||
try:
|
||||
s = b.decode("utf-16-le") if tag.endswith("1F") else b.decode("cp1250")
|
||||
except Exception:
|
||||
s = b.decode("latin-1", "ignore")
|
||||
s = s.strip()
|
||||
if "@" in s:
|
||||
smtp = s
|
||||
break
|
||||
if smtp:
|
||||
out.append(smtp)
|
||||
return out
|
||||
|
||||
|
||||
def analyze_file(path):
|
||||
o = olefile.OleFileIO(path)
|
||||
try:
|
||||
subject = rd(o, "0037001F")
|
||||
msgid = rd(o, "1035001F")
|
||||
sendacct = rd(o, "0E28001F")
|
||||
sentrep = rd(o, "0065001F")
|
||||
sender = rd(o, "0C1F001F")
|
||||
body = read_body(o)
|
||||
recs = recipients_smtp(o)
|
||||
finally:
|
||||
o.close()
|
||||
|
||||
low = body.lower()
|
||||
flags = []
|
||||
if any(s in low for s in FAIL_SIGNS):
|
||||
flags.append("FAIL_BODY")
|
||||
joined = " ".join([sendacct, sentrep, sender]).lower()
|
||||
if "buzalka.cz" in joined:
|
||||
flags.append("SENDAS_BUZ")
|
||||
if not msgid:
|
||||
flags.append("NO_MSGID")
|
||||
|
||||
# prijemce-lekar = externi (ne its.jnj.com)
|
||||
ext = [r for r in recs if not any(d in r.lower() for d in INTERNAL)]
|
||||
recipient = ext[0] if ext else (recs[0] if recs else "")
|
||||
|
||||
# datum z nazvu souboru (STRONG_YYYY-MM-DD_... / weak_YYYY-MM-DD_...)
|
||||
m = re.search(r"(\d{4}-\d{2}-\d{2})", os.path.basename(path))
|
||||
date = m.group(1) if m else ""
|
||||
|
||||
return {
|
||||
"file": os.path.basename(path),
|
||||
"date": date,
|
||||
"recipient": recipient,
|
||||
"subject": subject.strip(),
|
||||
"msgid": msgid.strip(),
|
||||
"send_account": sendacct.strip(),
|
||||
"sentrep": sentrep.strip(),
|
||||
"flags": "+".join(flags),
|
||||
"failed": "ANO" if ("FAIL_BODY" in flags or "SENDAS_BUZ" in flags) else "?",
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
indir = sys.argv[1] if len(sys.argv) > 1 else INPUT_DIR
|
||||
files = sorted(glob.glob(os.path.join(indir, "*.msg")))
|
||||
if not files:
|
||||
print("Zadne .msg v:", indir)
|
||||
return
|
||||
|
||||
rows = []
|
||||
for f in files:
|
||||
try:
|
||||
rows.append(analyze_file(f))
|
||||
except Exception as e:
|
||||
rows.append({"file": os.path.basename(f), "date": "", "recipient": "",
|
||||
"subject": "<chyba cteni>", "msgid": "", "send_account": "",
|
||||
"sentrep": "", "flags": "ERR:" + str(e), "failed": "?"})
|
||||
|
||||
# serad: nejdriv jiste selhane, pak dle data
|
||||
rows.sort(key=lambda r: (r["failed"] != "ANO", r["date"]))
|
||||
|
||||
n_fail = sum(1 for r in rows if r["failed"] == "ANO")
|
||||
n_sendas = sum(1 for r in rows if "SENDAS_BUZ" in r["flags"])
|
||||
n_failbody = sum(1 for r in rows if "FAIL_BODY" in r["flags"])
|
||||
n_nomid = sum(1 for r in rows if "NO_MSGID" in r["flags"])
|
||||
|
||||
print(f"Souboru: {len(rows)}")
|
||||
print(f" jiste selhane (FAIL_BODY/SENDAS_BUZ): {n_fail}")
|
||||
print(f" z toho SENDAS_BUZ (buzalka.cz): {n_sendas} | FAIL_BODY: {n_failbody}")
|
||||
print(f" jen NO_MSGID (slabe): {n_nomid - n_fail if n_nomid>=n_fail else n_nomid}")
|
||||
print("=" * 110)
|
||||
print(f"{'datum':10} {'prijemce':32} {'fail':4} {'flags':22} subjekt")
|
||||
print("-" * 110)
|
||||
for r in rows:
|
||||
print(f"{r['date']:10} {r['recipient'][:32]:32} {r['failed']:4} {r['flags']:22} {r['subject'][:40]}")
|
||||
|
||||
# xlsx
|
||||
wb = openpyxl.Workbook()
|
||||
ws = wb.active
|
||||
ws.title = "suspects"
|
||||
cols = ["file", "date", "recipient", "subject", "msgid", "send_account", "sentrep", "flags", "failed"]
|
||||
from openpyxl.cell.cell import ILLEGAL_CHARACTERS_RE
|
||||
|
||||
def clean(v):
|
||||
return ILLEGAL_CHARACTERS_RE.sub("", str(v)) if v is not None else ""
|
||||
|
||||
ws.append(cols)
|
||||
for r in rows:
|
||||
ws.append([clean(r[c]) for c in cols])
|
||||
stamp = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
|
||||
out = os.path.join(OUT_DIR, f"sent_suspects_analyza_{stamp}.xlsx")
|
||||
wb.save(out)
|
||||
print("\nXLSX:", out)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,51 @@
|
||||
# doplnujici_dotazy_v1.0 — evidence doplňujících dotazů na centra
|
||||
|
||||
**Verze:** 1.0 · **Datum:** 2026-06-17 · **Studie:** 77242113UCO3002 (ICONIC / DAWN)
|
||||
|
||||
## Účel
|
||||
Když v SIPIQ chybí odpověď a do dotazníku už NELZE vstoupit, doptáváme se centra zvlášť.
|
||||
Kolekce `feasibility.doplnujici_dotazy` eviduje, **ke kterému centru a ke které otázce** dotaz
|
||||
patří a v jakém je stavu. Souvisí s `sipiq_responses` / `sipiq_questions` (viz sipiq_import).
|
||||
|
||||
## Model (domluva 17JUN2026)
|
||||
- **1 dok = dotazová UDÁLOST** (může nést více otázek v `questions[]`).
|
||||
- Když centrum odpoví → odpověď se **promítne do `sipiq_responses.answers_supplement{}`**
|
||||
(`{value, source:"doplneno", doplnujici_dotaz_id, answered_at, answer_source}`); původní
|
||||
Qualtrics `answers` se **NEMĚNÍ**. Analýza/rekonstrukce pak může překrýt answers o answers_supplement.
|
||||
|
||||
## Struktura dokumentu
|
||||
```jsonc
|
||||
{
|
||||
"_id": ObjectId,
|
||||
"response_id": "R_…", // ref sipiq_responses._id
|
||||
"investigator_oid": ObjectId, // ref investigators
|
||||
"pi_last_name","site_name","site_country","pi_email", // denormalizace
|
||||
"status": "open", // open → asked → answered → closed / no_response
|
||||
"asked_at": null, "asked_via": null, "reason": "…", "note": null,
|
||||
"questions": [
|
||||
{"qcode":"Q72_1","question_base":"Q72","question_text":"…","section":"…",
|
||||
"answer":null,"answered_at":null,"answer_source":null,"status":"open"}
|
||||
],
|
||||
"created_at":"…","updated_at":"…","history":[]
|
||||
}
|
||||
```
|
||||
Indexy: `investigator_oid`, `response_id`, `status`, `questions.qcode`, `questions.status`.
|
||||
|
||||
## Příkazy
|
||||
```
|
||||
.venv\Scripts\python.exe Feasibility\doplnujici_dotazy_v1.0.py ensure
|
||||
.venv\Scripts\python.exe Feasibility\doplnujici_dotazy_v1.0.py add --center <email|prijmeni|R_id> [--country CZ|SK] \
|
||||
--qcodes Q72_1,Q73_1 [--reason "…"] [--asked-via "…"] [--status asked] [--note "…"] [--apply]
|
||||
.venv\Scripts\python.exe Feasibility\doplnujici_dotazy_v1.0.py answer --id <dotaz_id> --qcode Q72_1 \
|
||||
--answer "8" [--source "email 18JUN2026"] [--apply]
|
||||
.venv\Scripts\python.exe Feasibility\doplnujici_dotazy_v1.0.py list [--center …] [--open]
|
||||
```
|
||||
- `add`/`answer` defaultně **dry-run**, ostrý běh `--apply`.
|
||||
- `add` dohledá centrum v `sipiq_responses` (R_id / pi_email / příjmení+země) a text+sekci otázky
|
||||
v `sipiq_questions` (qcode může být leaf, např. Q72_1 → text báze Q72 + popisek item).
|
||||
- `answer` zapíše odpověď k otázce, přepočítá stav události (answered až když všechny otázky answered)
|
||||
a promítne do `sipiq_responses.answers_supplement`.
|
||||
|
||||
## Stav 17JUN2026
|
||||
Kolekce + indexy založeny (`ensure`), zatím 0 dokumentů. Dry-run `add` ověřen (Svoboda, Q72_1+Q73_1).
|
||||
Mongo 192.168.1.76:27017, bez auth, pymongo.
|
||||
@@ -0,0 +1,254 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
doplnujici_dotazy_v1.0.py
|
||||
=========================
|
||||
Verze: 1.0
|
||||
Datum: 2026-06-17
|
||||
Autor: Claude Code (pro MUDr. Vladimíra Buzalku)
|
||||
|
||||
Popis
|
||||
-----
|
||||
Správa kolekce `feasibility.doplnujici_dotazy` — evidence doplňujících dotazů na centra,
|
||||
když v SIPIQ chybí odpověď a do dotazníku už NELZE vstoupit. Víme tak, ke kterému centru
|
||||
(a ke které otázce) dotaz patří, a v jakém je stavu.
|
||||
|
||||
Model (domluva 17JUN2026): **1 dok = dotazová UDÁLOST** (může nést více otázek v `questions[]`).
|
||||
Když centrum odpoví, odpověď se PROMÍTNE i do `sipiq_responses.answers_supplement{}`
|
||||
(s příznakem source="doplneno"); původní Qualtrics `answers` se NEMĚNÍ.
|
||||
|
||||
Životní cyklus dotazu: open → asked → answered → closed / no_response.
|
||||
|
||||
Příkazy
|
||||
-------
|
||||
ensure
|
||||
Založí kolekci + indexy (idempotentní).
|
||||
|
||||
add --center <email|prijmeni|R_id> [--country CZ|SK] --qcodes Q72_1,Q73_1
|
||||
[--reason "…"] [--asked-via "…"] [--status asked] [--note "…"] [--apply]
|
||||
Založí novou dotazovou událost. Centrum + otázky se dohledají v sipiq_responses
|
||||
/ sipiq_questions; identita se denormalizuje. Default dry-run.
|
||||
|
||||
answer --id <dotaz_id> --qcode Q72_1 --answer "8" [--source "email 18JUN2026"] [--apply]
|
||||
Zapíše odpověď k jedné otázce události, promítne do sipiq_responses.answers_supplement,
|
||||
přepočítá stav události. Default dry-run.
|
||||
|
||||
list [--center <email|prijmeni>] [--open]
|
||||
Vypíše dotazy (volitelně jen otevřené / pro jedno centrum).
|
||||
|
||||
Mongo 192.168.1.76:27017, bez auth, pymongo.
|
||||
"""
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
from datetime import datetime, timezone
|
||||
|
||||
from pymongo import MongoClient, ASCENDING
|
||||
from bson import ObjectId
|
||||
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
DB = "feasibility"
|
||||
COL = "doplnujici_dotazy"
|
||||
COL_R = "sipiq_responses"
|
||||
COL_Q = "sipiq_questions"
|
||||
|
||||
OPEN_STATES = ("open", "asked")
|
||||
|
||||
|
||||
def now_iso():
|
||||
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
|
||||
|
||||
|
||||
def qbase(qcode):
|
||||
m = re.match(r"(Q\d+)", qcode)
|
||||
return m.group(1) if m else qcode
|
||||
|
||||
|
||||
def db_conn():
|
||||
c = MongoClient(MONGO_URI, serverSelectionTimeoutMS=8000)
|
||||
c.admin.command("ping")
|
||||
return c, c[DB]
|
||||
|
||||
|
||||
def ensure(db):
|
||||
db[COL].create_index([("investigator_oid", ASCENDING)])
|
||||
db[COL].create_index([("response_id", ASCENDING)])
|
||||
db[COL].create_index([("status", ASCENDING)])
|
||||
db[COL].create_index([("questions.qcode", ASCENDING)])
|
||||
db[COL].create_index([("questions.status", ASCENDING)])
|
||||
print(f"OK: kolekce '{COL}' + indexy připraveny. Dokumentů: {db[COL].count_documents({})}")
|
||||
|
||||
|
||||
def find_center(db, key, country=None):
|
||||
"""Najde sipiq_responses dle ResponseId / pi_email / příjmení."""
|
||||
if key.startswith("R_"):
|
||||
d = db[COL_R].find_one({"_id": key})
|
||||
if d:
|
||||
return d
|
||||
d = db[COL_R].find_one({"pi_email": key.lower()})
|
||||
if d:
|
||||
return d
|
||||
flt = {"pi_last_name": re.compile(f"^{re.escape(key)}$", re.I)}
|
||||
if country:
|
||||
flt["site_country"] = {"CZ": "Czech Republic", "SK": "Slovakia"}.get(country, country)
|
||||
cands = list(db[COL_R].find(flt))
|
||||
if len(cands) == 1:
|
||||
return cands[0]
|
||||
if len(cands) > 1:
|
||||
raise SystemExit(f"CHYBA: '{key}' je nejednoznačné ({len(cands)} center). Upřesni e-mailem nebo --country / R_id.")
|
||||
raise SystemExit(f"CHYBA: centrum '{key}' nenalezeno v {COL_R}.")
|
||||
|
||||
|
||||
def question_meta(db, qcode):
|
||||
"""Text + sekce otázky z sipiq_questions (qcode může být leaf, např. Q72_1)."""
|
||||
base = qbase(qcode)
|
||||
q = db[COL_Q].find_one({"_id": base})
|
||||
if not q:
|
||||
return {"question_base": base, "question_text": None, "section": None}
|
||||
text = q.get("text")
|
||||
label = None
|
||||
for it in q.get("items", []):
|
||||
if it.get("key") == qcode:
|
||||
label = it.get("label")
|
||||
break
|
||||
full = f"{text} — {label}" if label else text
|
||||
return {"question_base": base, "question_text": full, "section": q.get("section")}
|
||||
|
||||
|
||||
def cmd_add(db, args, dry):
|
||||
center = find_center(db, args.center, args.country)
|
||||
qcodes = [q.strip() for q in args.qcodes.split(",") if q.strip()]
|
||||
questions = []
|
||||
for qc in qcodes:
|
||||
meta = question_meta(db, qc)
|
||||
questions.append({
|
||||
"qcode": qc, "question_base": meta["question_base"],
|
||||
"question_text": meta["question_text"], "section": meta["section"],
|
||||
"answer": None, "answered_at": None, "answer_source": None, "status": "open",
|
||||
})
|
||||
ts = now_iso()
|
||||
doc = {
|
||||
"response_id": center["_id"],
|
||||
"investigator_oid": center.get("investigator_oid"),
|
||||
"pi_last_name": center.get("pi_last_name"),
|
||||
"site_name": center.get("site_name"),
|
||||
"site_country": center.get("site_country"),
|
||||
"pi_email": center.get("pi_email"),
|
||||
"status": args.status,
|
||||
"asked_at": ts if args.status == "asked" else None,
|
||||
"asked_via": args.asked_via,
|
||||
"reason": args.reason or "neodpovězeno v SIPIQ; dotazník už uzavřen",
|
||||
"note": args.note,
|
||||
"questions": questions,
|
||||
"created_at": ts, "updated_at": ts, "history": [],
|
||||
}
|
||||
print(f"Centrum: {doc['pi_last_name']} / {doc['site_name']} ({doc['site_country']}) resp={doc['response_id']}")
|
||||
for q in questions:
|
||||
print(f" • {q['qcode']:10} [{q['section']}] {q['question_text']}")
|
||||
if dry:
|
||||
print("[DRY-RUN] Nezaloženo. Ostrý: --apply")
|
||||
return
|
||||
res = db[COL].insert_one(doc)
|
||||
print(f"[APPLY] Založen dotaz _id={res.inserted_id}")
|
||||
|
||||
|
||||
def cmd_answer(db, args, dry):
|
||||
doc = db[COL].find_one({"_id": ObjectId(args.id)})
|
||||
if not doc:
|
||||
raise SystemExit(f"CHYBA: dotaz _id={args.id} nenalezen.")
|
||||
qs = doc["questions"]
|
||||
target = next((q for q in qs if q["qcode"] == args.qcode), None)
|
||||
if not target:
|
||||
raise SystemExit(f"CHYBA: otázka {args.qcode} není v tomto dotazu (má: {[q['qcode'] for q in qs]}).")
|
||||
ts = now_iso()
|
||||
print(f"Centrum: {doc['pi_last_name']} / {doc['site_name']} resp={doc['response_id']}")
|
||||
print(f" {args.qcode}: {target.get('answer')!r} -> {args.answer!r} (zdroj: {args.source})")
|
||||
print(f" + promítnutí do {COL_R}.answers_supplement.{args.qcode}")
|
||||
if dry:
|
||||
print("[DRY-RUN] Nezapsáno. Ostrý: --apply")
|
||||
return
|
||||
# 1) update otázky v události
|
||||
for q in qs:
|
||||
if q["qcode"] == args.qcode:
|
||||
q["answer"] = args.answer
|
||||
q["answered_at"] = ts
|
||||
q["answer_source"] = args.source
|
||||
q["status"] = "answered"
|
||||
all_answered = all(q["status"] == "answered" for q in qs)
|
||||
new_status = "answered" if all_answered else "asked"
|
||||
db[COL].update_one({"_id": doc["_id"]}, {
|
||||
"$set": {"questions": qs, "status": new_status, "updated_at": ts},
|
||||
"$push": {"history": {"changed_at": ts, "action": "answer",
|
||||
"qcode": args.qcode, "answer": args.answer, "source": args.source}},
|
||||
})
|
||||
# 2) promítnout do sipiq_responses.answers_supplement (původní answers NEMĚNÍM)
|
||||
db[COL_R].update_one({"_id": doc["response_id"]}, {
|
||||
"$set": {f"answers_supplement.{args.qcode}": {
|
||||
"value": args.answer, "source": "doplneno",
|
||||
"doplnujici_dotaz_id": doc["_id"], "answered_at": ts, "answer_source": args.source,
|
||||
}}
|
||||
})
|
||||
print(f"[APPLY] Odpověď zapsána; stav události = {new_status}; promítnuto do {COL_R}.")
|
||||
|
||||
|
||||
def cmd_list(db, args):
|
||||
flt = {}
|
||||
if args.open:
|
||||
flt["status"] = {"$in": list(OPEN_STATES)}
|
||||
if args.center:
|
||||
key = args.center
|
||||
if key.startswith("R_"):
|
||||
flt["response_id"] = key
|
||||
elif "@" in key:
|
||||
flt["pi_email"] = key.lower()
|
||||
else:
|
||||
flt["pi_last_name"] = re.compile(f"^{re.escape(key)}$", re.I)
|
||||
docs = list(db[COL].find(flt).sort("created_at", -1))
|
||||
print(f"Dotazů: {len(docs)}")
|
||||
for d in docs:
|
||||
print(f"\n[{d['_id']}] {d['pi_last_name']} / {d['site_name']} ({d.get('site_country')}) — {d['status']}")
|
||||
for q in d["questions"]:
|
||||
a = q.get("answer")
|
||||
print(f" {q['qcode']:10} {q['status']:9} {('= '+str(a)) if a else '(čeká)'} | {q.get('question_text')}")
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
sub = ap.add_subparsers(dest="cmd", required=True)
|
||||
sub.add_parser("ensure")
|
||||
pa = sub.add_parser("add")
|
||||
pa.add_argument("--center", required=True)
|
||||
pa.add_argument("--country")
|
||||
pa.add_argument("--qcodes", required=True)
|
||||
pa.add_argument("--reason")
|
||||
pa.add_argument("--asked-via", dest="asked_via")
|
||||
pa.add_argument("--status", default="open", choices=["open", "asked"])
|
||||
pa.add_argument("--note")
|
||||
pa.add_argument("--apply", action="store_true")
|
||||
pn = sub.add_parser("answer")
|
||||
pn.add_argument("--id", required=True)
|
||||
pn.add_argument("--qcode", required=True)
|
||||
pn.add_argument("--answer", required=True)
|
||||
pn.add_argument("--source")
|
||||
pn.add_argument("--apply", action="store_true")
|
||||
pl = sub.add_parser("list")
|
||||
pl.add_argument("--center")
|
||||
pl.add_argument("--open", action="store_true")
|
||||
args = ap.parse_args()
|
||||
|
||||
client, db = db_conn()
|
||||
try:
|
||||
if args.cmd == "ensure":
|
||||
ensure(db)
|
||||
elif args.cmd == "add":
|
||||
cmd_add(db, args, dry=not args.apply)
|
||||
elif args.cmd == "answer":
|
||||
cmd_answer(db, args, dry=not args.apply)
|
||||
elif args.cmd == "list":
|
||||
cmd_list(db, args)
|
||||
finally:
|
||||
client.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,48 @@
|
||||
# jnj_dump_recipient_msgs_v1.0.py
|
||||
|
||||
**Verze:** 1.0 · **Datum:** 2026-06-16
|
||||
|
||||
JNJ-native (pywin32 / MAPI). Najde **všechny e-maily danému příjemci** (default
|
||||
Hušták) napříč vybranými složkami, **uloží je jako `.msg`** a u každého **vypíše
|
||||
diagnostické MAPI vlastnosti čtené ze živé položky**. Účel: ověřit, zda
|
||||
vlastnosti (GAL jméno, ReportText, send-account, Message-ID…) zůstanou i
|
||||
v uloženém `.msg` (porovnání olefilem doma).
|
||||
|
||||
Skript **nic neodesílá ani nemaže** — jen čte a ukládá `.msg` kopie.
|
||||
|
||||
## Spuštění (JNJ stroj s Outlookem)
|
||||
```
|
||||
pip install pywin32
|
||||
python jnj_dump_recipient_msgs_v1.0.py
|
||||
```
|
||||
|
||||
## Co vypíše u každého e-mailu (ze ŽIVÉ položky)
|
||||
- složka, role (To/Cc), `item.Sent`, `PR_MESSAGE_FLAGS` (0x0E07)
|
||||
- subjekt, čas odeslání
|
||||
- **Msg-ID** `0x1035`
|
||||
- **SenderName** `0x0C1A` + addrtype `0x0C1E`
|
||||
- **SentRepresentingName** `0x0042` + addrtype `0x0064`
|
||||
- **PrimarySendAccount** `0x0E28` (odhalí posílání „jako buzalka.cz")
|
||||
- **ReportText** `0x1001` (NDR „could not be sent…" = selhání)
|
||||
|
||||
…a pak položku uloží jako `.msg` do `OUTPUT_DIR`.
|
||||
|
||||
## Konfigurace
|
||||
- `TARGET_EMAIL` — koho hledat (default `rastislav.hustak@fntt.sk`).
|
||||
- `SCAN_FOLDERS` — názvy složek (vč. podsložek); default Sent Items, Drafts,
|
||||
Deleted Items, Archive, Inbox. `SCAN_ALL=True` = celá schránka (pomalé).
|
||||
- `OUTPUT_DIR` — kam ukládat `.msg` (default `C:\Users\vbuzalka\hustak_dump`).
|
||||
- `SENDER_SMTP` — účet, jehož store se prohledává.
|
||||
|
||||
## Po spuštění
|
||||
1. Porovnej výpis (živé vlastnosti) — uvidíš, který e-mail má GAL jméno /
|
||||
ReportText / send-account buzalka.cz.
|
||||
2. Přenes `.msg` z `OUTPUT_DIR` domů (libovolně, např. přes msgreceiver
|
||||
upload nebo ručně) a olefilem zkontroluj, zda jsou v uloženém `.msg`
|
||||
stejné vlastnosti jako na živé položce.
|
||||
|
||||
## Pozn.
|
||||
- Match příjemce přes `PR_SMTP_ADDRESS` (0x39FE) → spolehlivě i pro interní
|
||||
Exchange příjemce.
|
||||
- `olMSG = 3` (SaveAs typ). Název souboru = index + složka + subjekt + konec
|
||||
EntryID (kvůli párování).
|
||||
@@ -0,0 +1,188 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# =============================================================================
|
||||
# Nazev: jnj_dump_recipient_msgs_v1.0.py
|
||||
# Verze: 1.0
|
||||
# Datum: 2026-06-16
|
||||
# Popis: JNJ-native (MAPI / pywin32). Najde VSECHNY e-maily danemu prijemci
|
||||
# (default Hustak) napric vybranymi slozkami, ULOZI je jako .msg a
|
||||
# u kazdeho VYPISE diagnosticke MAPI vlastnosti precteni ze ZIVE
|
||||
# polozky (Message-ID 0x1035, SenderName 0x0C1A, SentRepresentingName
|
||||
# 0x0042, addrtype 0x0C1E/0x0064, ReportText 0x1001, PrimarySendAccount
|
||||
# 0x0E28, MessageFlags 0x0E07, item.Sent). Cil: porovnat, zda tyto
|
||||
# vlastnosti zustanou i v ulozenem .msg (olefile kontrola doma).
|
||||
# Pouziti: Spustit v JNJ Pythonu (Thonny), Outlook s JNJ schrankou.
|
||||
# pip install pywin32 ; python jnj_dump_recipient_msgs_v1.0.py
|
||||
# Skript NIC neodesila ani nemaze, jen CTE a uklada .msg kopie.
|
||||
# =============================================================================
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import win32com.client # pywin32
|
||||
|
||||
# ----------------------------- KONFIGURACE -----------------------------------
|
||||
|
||||
SENDER_SMTP = "vbuzalka@its.jnj.com" # ucet (jeho store se prohledava)
|
||||
TARGET_EMAIL = "rastislav.hustak@fntt.sk" # koho hledame (To NEBO Cc)
|
||||
|
||||
# Slozky k prohledani (shoda na NAZEV slozky kdekoli ve strome; vc. podslozek).
|
||||
# Prazdny seznam + SCAN_ALL=True => projde celou schranku (pomale!).
|
||||
SCAN_FOLDERS = ["Sent Items", "Drafts", "Deleted Items", "Archive", "Inbox"]
|
||||
SCAN_ALL = False
|
||||
|
||||
# Kam ulozit .msg kopie (na JNJ stroji). Vytvori se, kdyz neexistuje.
|
||||
OUTPUT_DIR = r"C:\Users\vbuzalka\hustak_dump"
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
OL_MSG = 3 # olMSG (SaveAs typ)
|
||||
OL_FOLDER_SENT = 5
|
||||
PA = "http://schemas.microsoft.com/mapi/proptag/0x{:s}"
|
||||
|
||||
# Diagnosticke tagy (PT_UNICODE 001F, dlouhe 0003)
|
||||
TAGS = [
|
||||
("Msg-ID", "1035001F"),
|
||||
("SenderName", "0C1A001F"),
|
||||
("SenderAddrType", "0C1E001F"),
|
||||
("SentRepName", "0042001F"),
|
||||
("SentRepAddrType", "0064001F"),
|
||||
("ReportText", "1001001F"),
|
||||
("PrimarySendAcct", "0E28001F"),
|
||||
]
|
||||
TAG_MSGFLAGS = "0E070003"
|
||||
TAG_RCPT_ADDRTYPE = "3002001F"
|
||||
|
||||
|
||||
def smtp_of(recipient):
|
||||
try:
|
||||
return (recipient.PropertyAccessor.GetProperty(PA.format("39FE001E")) or "").lower()
|
||||
except Exception:
|
||||
try:
|
||||
return (recipient.Address or "").lower()
|
||||
except Exception:
|
||||
return ""
|
||||
|
||||
|
||||
def get_prop(item, tag):
|
||||
try:
|
||||
v = item.PropertyAccessor.GetProperty(PA.format(tag))
|
||||
return v
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def get_store_root(ns):
|
||||
try:
|
||||
for acct in ns.Accounts:
|
||||
if (acct.SmtpAddress or "").lower() == SENDER_SMTP.lower():
|
||||
return acct.DeliveryStore.GetRootFolder()
|
||||
except Exception:
|
||||
pass
|
||||
return ns.GetDefaultFolder(OL_FOLDER_SENT).Parent # fallback: koren default store
|
||||
|
||||
|
||||
def iter_target_folders(root):
|
||||
"""Yield slozek, ktere se maji skenovat (dle nazvu + jejich podslozky)."""
|
||||
def walk(folder, inscope):
|
||||
scope = inscope or SCAN_ALL or (folder.Name in SCAN_FOLDERS)
|
||||
if scope:
|
||||
yield folder
|
||||
try:
|
||||
for sub in folder.Folders:
|
||||
yield from walk(sub, scope)
|
||||
except Exception:
|
||||
pass
|
||||
yield from walk(root, False)
|
||||
|
||||
|
||||
def safe(s, n=40):
|
||||
s = re.sub(r"[^A-Za-z0-9._-]+", "_", (s or ""))
|
||||
return s[:n].strip("_")
|
||||
|
||||
|
||||
def matches_target(item):
|
||||
"""Vrati ('To'/'Cc') kdyz je TARGET_EMAIL mezi prijemci, jinak None."""
|
||||
tgt = TARGET_EMAIL.lower()
|
||||
try:
|
||||
for r in item.Recipients:
|
||||
if smtp_of(r) == tgt:
|
||||
return {1: "To", 2: "Cc", 3: "Bcc"}.get(r.Type, "To")
|
||||
except Exception:
|
||||
pass
|
||||
return None
|
||||
|
||||
|
||||
def main():
|
||||
os.makedirs(OUTPUT_DIR, exist_ok=True)
|
||||
outlook = win32com.client.Dispatch("Outlook.Application")
|
||||
ns = outlook.GetNamespace("MAPI")
|
||||
root = get_store_root(ns)
|
||||
|
||||
print(f"Hledam e-maily, kde je prijemce: {TARGET_EMAIL}")
|
||||
print(f"Slozky: {'VSE' if SCAN_ALL else ', '.join(SCAN_FOLDERS)}")
|
||||
print(f"Vystup .msg: {OUTPUT_DIR}")
|
||||
print("=" * 90)
|
||||
|
||||
idx = 0
|
||||
for folder in iter_target_folders(root):
|
||||
try:
|
||||
items = folder.Items
|
||||
except Exception:
|
||||
continue
|
||||
for it in list(items):
|
||||
try:
|
||||
if it.Class != 43: # olMail
|
||||
continue
|
||||
except Exception:
|
||||
continue
|
||||
role = matches_target(it)
|
||||
if not role:
|
||||
continue
|
||||
idx += 1
|
||||
|
||||
# --- diagnostika ze ZIVE polozky ---
|
||||
try:
|
||||
sent_flag = it.Sent
|
||||
except Exception:
|
||||
sent_flag = "?"
|
||||
flags = get_prop(it, TAG_MSGFLAGS)
|
||||
props = {label: get_prop(it, tag) for label, tag in TAGS}
|
||||
try:
|
||||
sent_on = it.SentOn
|
||||
except Exception:
|
||||
sent_on = None
|
||||
try:
|
||||
entry_tail = (it.EntryID or "")[-20:]
|
||||
except Exception:
|
||||
entry_tail = ""
|
||||
|
||||
print(f"\n[{idx}] slozka='{folder.Name}' role={role} Sent={sent_flag} flags={flags}")
|
||||
print(f" subject : {getattr(it,'Subject','')}")
|
||||
print(f" sent_on : {sent_on}")
|
||||
print(f" Msg-ID : {props['Msg-ID']}")
|
||||
print(f" SenderName : {props['SenderName']} (addrtype {props['SenderAddrType']})")
|
||||
print(f" SentRepName : {props['SentRepName']} (addrtype {props['SentRepAddrType']})")
|
||||
print(f" PrimarySendAcct: {props['PrimarySendAcct']}")
|
||||
rt = props["ReportText"]
|
||||
print(f" ReportText 0x1001: {'ANO -> ' + repr(rt[:120]) if rt else '-'}")
|
||||
|
||||
# --- ulozeni .msg ---
|
||||
fn = f"{idx:02d}_{safe(folder.Name,18)}_{safe(getattr(it,'Subject',''),28)}_{entry_tail}.msg"
|
||||
path = os.path.join(OUTPUT_DIR, fn)
|
||||
try:
|
||||
it.SaveAs(path, OL_MSG)
|
||||
print(f" ulozeno: {fn}")
|
||||
except Exception as e:
|
||||
print(f" !! SaveAs chyba: {e}")
|
||||
|
||||
print("\n" + "=" * 90)
|
||||
print(f"Hotovo. Nalezeno a ulozeno: {idx} polozek do {OUTPUT_DIR}")
|
||||
print("Prines .msg domu a porovnej vlastnosti olefilem (zive vs ulozene).")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
main()
|
||||
except Exception as e:
|
||||
print("CHYBA:", e)
|
||||
sys.exit(1)
|
||||
@@ -0,0 +1,45 @@
|
||||
# jnj_scan_failed_sent_v1.0.py
|
||||
|
||||
**Verze:** 1.0 · **Datum:** 2026-06-16
|
||||
|
||||
JNJ-native (pywin32 / MAPI). Projde **Sent Items za posledních N dní** (default 60),
|
||||
najde **podezřelé = pravděpodobně neodeslané** e-maily, uloží je jako `.msg`
|
||||
a vypíše, které příznaky se trefily. **Nic neodesílá ani nemaže.**
|
||||
|
||||
## Příznaky (čteno ze ŽIVÉ položky)
|
||||
- **FAIL_BODY** (silný) — tělo / ReportText obsahuje „could not be sent",
|
||||
„SendAsDenied", „permission to send the message on behalf",
|
||||
„TransportSend operation has failed", „MapiExceptionSendAsDenied".
|
||||
- **SENDAS_BUZ** (silný) — `PrimarySendAccount` (0x0E28) / SentRepresenting (0x0065)
|
||||
/ Sender (0x0C1F) obsahuje `buzalka.cz` → posíláno přes špatnou identitu.
|
||||
- **NO_MSGID** (slabý) — chybí Internet Message-ID (0x1035); může být i
|
||||
provizorní kopie, co se později dokončí.
|
||||
|
||||
`STRONG_*` soubory = silný příznak (skoro jistě neodesláno).
|
||||
`weak_*` soubory = jen NO_MSGID.
|
||||
|
||||
## Spuštění (JNJ stroj s Outlookem)
|
||||
```
|
||||
pip install pywin32
|
||||
python jnj_scan_failed_sent_v1.0.py
|
||||
```
|
||||
|
||||
## Konfigurace
|
||||
- `DAYS` = okno (default 60).
|
||||
- `OUTPUT_DIR` = kam ukládat `.msg` (default `C:\Users\vbuzalka\sent_suspects`).
|
||||
- `INCLUDE_NO_MSGID` = ukládat i jen-NO_MSGID položky (default True; dej False,
|
||||
když chceš jen tvrdé FAIL/SENDAS).
|
||||
- `SENDER_SMTP` = účet, jehož Sent Items se skenuje.
|
||||
|
||||
## Postup
|
||||
1. Spusť na JNJ → ve výpisu uvidíš podezřelé + uložené `.msg`.
|
||||
2. Přines `.msg` z `OUTPUT_DIR` domů → olefilem je projdeme a potvrdíme,
|
||||
které opravdu neodešly (a komu je třeba poslat znovu se správným From).
|
||||
|
||||
## Pozn.
|
||||
- Okno 60 dní = výkon (řazeno SentOn desc, starší se přeskočí brzy).
|
||||
- Detekce funguje nad **živou** položkou (čerstvý SaveAs) — proto se pouští
|
||||
přímo na JNJ, ne nad starými batch kopiemi.
|
||||
- Hlavní příčina selhání: From = `vladimir.buzalka@buzalka.cz` na účtu
|
||||
`vbuzalka@its.jnj.com` bez SendAs → Exchange odmítne. Viz paměť
|
||||
project_jnj_unsent_detection.
|
||||
@@ -0,0 +1,191 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# =============================================================================
|
||||
# Nazev: jnj_scan_failed_sent_v1.0.py
|
||||
# Verze: 1.0
|
||||
# Datum: 2026-06-16
|
||||
# Popis: JNJ-native (MAPI / pywin32). Projde slozku Odeslane (Sent Items) za
|
||||
# poslednich N dni a najde PODEZRELE e-maily = pravdepodobne NEODESLANE
|
||||
# (napr. SendAs denied). Kazdy podezrely ULOZI jako .msg a vypise, ktere
|
||||
# priznaky se trefily. NIC neodesila ani nemaze, jen CTE a uklada.
|
||||
# Priznaky podezreni (cteno ze ZIVE polozky):
|
||||
# FAIL_BODY = telo/ReportText obsahuje "could not be sent" / "SendAsDenied"
|
||||
# / "permission to send the message on behalf" / "TransportSend"
|
||||
# SENDAS_BUZ = PrimarySendAccount/SentRepresenting/Sender obsahuje "buzalka.cz"
|
||||
# NO_MSGID = chybi Internet Message-ID (0x1035) -- slabsi priznak
|
||||
# Pouziti: JNJ Python (Thonny), Outlook s JNJ schrankou.
|
||||
# pip install pywin32 ; python jnj_scan_failed_sent_v1.0.py
|
||||
# =============================================================================
|
||||
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
import datetime
|
||||
import win32com.client # pywin32
|
||||
|
||||
# ----------------------------- KONFIGURACE -----------------------------------
|
||||
|
||||
SENDER_SMTP = "vbuzalka@its.jnj.com"
|
||||
DAYS = 60 # okno: poslednich N dni
|
||||
OUTPUT_DIR = r"C:\Users\vbuzalka\sent_suspects"
|
||||
|
||||
# Ukladat i polozky, ktere maji JEN slaby priznak NO_MSGID (bez FAIL/SENDAS)?
|
||||
# True = vc. provizornich kopii bez Message-ID (muze byt vic souboru).
|
||||
INCLUDE_NO_MSGID = True
|
||||
|
||||
# -----------------------------------------------------------------------------
|
||||
|
||||
OL_MSG = 3
|
||||
OL_FOLDER_SENT = 5
|
||||
PA = "http://schemas.microsoft.com/mapi/proptag/0x{:s}"
|
||||
|
||||
P_MSGID = "1035001F"
|
||||
P_SENDACCT = "0E28001F" # PrimarySendAccount
|
||||
P_SENTREP_EM = "0065001F" # SentRepresentingEmailAddress
|
||||
P_SENDER_EM = "0C1F001F" # SenderEmailAddress
|
||||
P_REPORTTEXT = "1001001F" # ReportText (kdyz existuje)
|
||||
|
||||
FAIL_SIGNS = [
|
||||
"could not be sent",
|
||||
"sendasdenied",
|
||||
"permission to send the message on behalf",
|
||||
"transportsend operation has failed",
|
||||
"mapiexceptionsendasdenied",
|
||||
"tuto zpravu nelze odeslat", # pro pripad lokalizace
|
||||
]
|
||||
|
||||
|
||||
def gp(item, tag):
|
||||
try:
|
||||
return item.PropertyAccessor.GetProperty(PA.format(tag))
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
def get_sent_folder(ns):
|
||||
try:
|
||||
for acct in ns.Accounts:
|
||||
if (acct.SmtpAddress or "").lower() == SENDER_SMTP.lower():
|
||||
return acct.DeliveryStore.GetDefaultFolder(OL_FOLDER_SENT)
|
||||
except Exception:
|
||||
pass
|
||||
return ns.GetDefaultFolder(OL_FOLDER_SENT)
|
||||
|
||||
|
||||
def safe(s, n=34):
|
||||
return re.sub(r"[^A-Za-z0-9._-]+", "_", (s or ""))[:n].strip("_")
|
||||
|
||||
|
||||
def analyze(item):
|
||||
"""Vrati seznam priznaku (flags) pro polozku."""
|
||||
flags = []
|
||||
|
||||
# 1) FAIL_BODY: telo + ReportText
|
||||
blob = ""
|
||||
try:
|
||||
blob += (item.Body or "")
|
||||
except Exception:
|
||||
pass
|
||||
rt = gp(item, P_REPORTTEXT)
|
||||
if rt:
|
||||
blob += "\n" + str(rt)
|
||||
low = blob.lower()
|
||||
if any(s in low for s in FAIL_SIGNS):
|
||||
flags.append("FAIL_BODY")
|
||||
|
||||
# 2) SENDAS_BUZ: nektera z odesilatelskych poloz. obsahuje buzalka.cz
|
||||
for tag in (P_SENDACCT, P_SENTREP_EM, P_SENDER_EM):
|
||||
v = gp(item, tag)
|
||||
if v and "buzalka.cz" in str(v).lower():
|
||||
flags.append("SENDAS_BUZ")
|
||||
break
|
||||
|
||||
# 3) NO_MSGID
|
||||
mid = gp(item, P_MSGID)
|
||||
if not mid:
|
||||
flags.append("NO_MSGID")
|
||||
|
||||
return flags, (mid or "")
|
||||
|
||||
|
||||
def main():
|
||||
os.makedirs(OUTPUT_DIR, exist_ok=True)
|
||||
cutoff = datetime.date.today() - datetime.timedelta(days=DAYS)
|
||||
|
||||
outlook = win32com.client.Dispatch("Outlook.Application")
|
||||
ns = outlook.GetNamespace("MAPI")
|
||||
sent = get_sent_folder(ns)
|
||||
items = sent.Items
|
||||
items.Sort("[SentOn]", True) # nejnovejsi prvni
|
||||
|
||||
print(f"Slozka : {sent.FolderPath}")
|
||||
print(f"Okno : poslednich {DAYS} dni (od {cutoff.isoformat()})")
|
||||
print(f"Vystup : {OUTPUT_DIR}")
|
||||
print(f"NO_MSGID se uklada: {INCLUDE_NO_MSGID}")
|
||||
print("=" * 90)
|
||||
|
||||
scanned = saved = strong = 0
|
||||
for it in list(items):
|
||||
try:
|
||||
if it.Class != 43:
|
||||
continue
|
||||
except Exception:
|
||||
continue
|
||||
# datum + early stop
|
||||
try:
|
||||
s = it.SentOn
|
||||
sdate = datetime.date(s.year, s.month, s.day)
|
||||
except Exception:
|
||||
sdate = None
|
||||
if sdate is not None:
|
||||
if sdate < cutoff:
|
||||
break # dale uz jen starsi (serazeno desc)
|
||||
scanned += 1
|
||||
|
||||
flags, mid = analyze(it)
|
||||
if not flags:
|
||||
continue
|
||||
is_strong = ("FAIL_BODY" in flags) or ("SENDAS_BUZ" in flags)
|
||||
if not is_strong and not (INCLUDE_NO_MSGID and "NO_MSGID" in flags):
|
||||
continue
|
||||
|
||||
saved += 1
|
||||
if is_strong:
|
||||
strong += 1
|
||||
|
||||
subj = ""
|
||||
try:
|
||||
subj = it.Subject or ""
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
tail = (it.EntryID or "")[-20:]
|
||||
except Exception:
|
||||
tail = ""
|
||||
|
||||
tagstr = "+".join(flags)
|
||||
print(f"\n[{saved}] {sdate} flags={tagstr}")
|
||||
print(f" subj : {subj}")
|
||||
print(f" msgid: {mid if mid else '<chybi>'}")
|
||||
|
||||
fn = f"{('STRONG' if is_strong else 'weak')}_{sdate}_{safe(subj,30)}_{tail}.msg"
|
||||
path = os.path.join(OUTPUT_DIR, fn)
|
||||
try:
|
||||
it.SaveAs(path, OL_MSG)
|
||||
print(f" ulozeno: {fn}")
|
||||
except Exception as e:
|
||||
print(f" !! SaveAs chyba: {e}")
|
||||
|
||||
print("\n" + "=" * 90)
|
||||
print(f"Prohledano (v okne): {scanned}")
|
||||
print(f"Ulozeno podezrelych: {saved} (z toho silnych FAIL/SENDAS: {strong})")
|
||||
print(f"Soubory v: {OUTPUT_DIR} -> prines je domu ke kontrole.")
|
||||
print("Pozn.: STRONG_* = telo NDR nebo send-account buzalka.cz (skoro jiste neodeslano).")
|
||||
print(" weak_* = jen chybi Message-ID (muze byt i provizorni kopie, co se pozdeji dokonci).")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
main()
|
||||
except Exception as e:
|
||||
print("CHYBA:", e)
|
||||
sys.exit(1)
|
||||
@@ -0,0 +1,63 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# =============================================================================
|
||||
# Nazev: promote_sipiq_submitted_v1.0.py
|
||||
# Verze: 1.0
|
||||
# Datum: 2026-06-17
|
||||
# Popis: Posune dane investigatory (KROK 6 - SIPIQ odeslan) na
|
||||
# KROK "7 - SIPIQ vyplneny" na zaklade Illuminator exportu
|
||||
# (status "SIPIQ Submitted"). Illuminator = ultimatni zdroj, protoze
|
||||
# lekar vyplneni SIPIQ nemusi oznamit e-mailem. Predřadi radek do STATUS.
|
||||
# Pouziti: python promote_sipiq_submitted_v1.0.py (dry-run)
|
||||
# python promote_sipiq_submitted_v1.0.py --apply
|
||||
# =============================================================================
|
||||
import sys
|
||||
from pymongo import MongoClient
|
||||
from bson import ObjectId
|
||||
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
LINE = ("17JUN2026: SIPIQ VYPLNENY — dle Illuminator exportu (status „SIPIQ "
|
||||
"Submitted“); lekar vyplneni neoznamil, Illuminator = ultimatni zdroj. KROK 7.")
|
||||
|
||||
# 13 investigatoru se SIPIQ Submitted v Illuminatoru, v Mongo zatim KROK 6
|
||||
IDS = [
|
||||
("6a19832b5fc2213518257969", "Durina Juraj"),
|
||||
("6a19832b5fc221351825796e", "Falc Matej"),
|
||||
("6a19832b5fc2213518257954", "Fedurco Miroslav"),
|
||||
("6a19832b5fc221351825796c", "Gregar Jan"),
|
||||
("6a19832b5fc221351825794f", "Hlavaty Tibor"),
|
||||
("6a19832b5fc2213518257973", "Horvath Frantisek"),
|
||||
("6a19832b5fc221351825796f", "Konecny Michal"),
|
||||
("6a19832b5fc2213518257972", "Konecny Stefan"),
|
||||
("6a1c4275aa46d8b608065cec", "Lukac Ludovit"),
|
||||
("6a19832b5fc2213518257958", "Mihalkanin Lubomir"),
|
||||
("6a198b661218c31ab0f5ba41", "Pesta Martin"),
|
||||
("6a19832b5fc221351825795e", "Stepek David"),
|
||||
("6a198b661218c31ab0f5ba43", "Tichy Michal"),
|
||||
]
|
||||
|
||||
|
||||
def main():
|
||||
apply = "--apply" in sys.argv
|
||||
col = MongoClient(MONGO_URI)["feasibility"]["investigators"]
|
||||
n = 0
|
||||
for hid, label in IDS:
|
||||
oid = ObjectId(hid)
|
||||
d = col.find_one({"_id": oid}, {"STATUS": 1, "KROK": 1})
|
||||
if not d:
|
||||
print(f" !! {label}: NENALEZEN"); continue
|
||||
krok = d.get("KROK", "")
|
||||
if not krok.startswith("6"):
|
||||
print(f" ~~ {label}: KROK={krok} (neni 6) -> preskakuji"); continue
|
||||
print(f" [{label}] KROK {krok} -> 7 - SIPIQ vyplneny")
|
||||
if apply:
|
||||
new_status = LINE + "\n" + (d.get("STATUS", "") or "")
|
||||
col.update_one({"_id": oid}, {"$set": {
|
||||
"KROK": "7 - SIPIQ vyplneny", "STATUS": new_status}})
|
||||
n += 1
|
||||
print(f"\n{'ZAPSANO' if apply else 'DRY-RUN'}: {n if apply else len(IDS)}/{len(IDS)}")
|
||||
if not apply:
|
||||
print(">>> Pro zapis spust s --apply")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,40 @@
|
||||
# sipiq_import_v1.2 — import SIPIQ odpovědí (folder workflow + provenance)
|
||||
|
||||
**Verze:** 1.2 · **Datum:** 2026-06-17 · **Studie:** 77242113UCO3002 (ICONIC / DAWN)
|
||||
|
||||
## Změny
|
||||
- **v1.2:** ke každé odpovědi `source_exported_at` = **datum/čas reportu podle filesystému**
|
||||
(mtime CSV souboru). Mimo content-hash → nezpůsobuje zbytečné UPDATE; backfilluje se i na
|
||||
"beze změny" cestě. v1.1 → `Feasibility\TRASH`.
|
||||
- **v1.1:** FOLDER workflow (`--folder`) — sebere *.csv, delta import, přesun do `Zpracováno`.
|
||||
|
||||
## Kolekce
|
||||
- `sipiq_questions` — slovník dotazníku (rekonstrukce SIPIQ jako v PDF).
|
||||
- `sipiq_responses` — 1 dok = 1 odpověď (`_id`=ResponseId), ploché `answers{}`,
|
||||
soft-link `investigator_oid`, `source_file` + `source_exported_at`, delta + `history[]`.
|
||||
|
||||
Zdroj = Qualtrics **CSV** (ř.1 Qcode, ř.2 text otázky, ř.3 ImportId=QID). Export labels,
|
||||
desetinná tečka, recode unanswered vypnuté.
|
||||
|
||||
## Delta (přepíše JEN změněná data)
|
||||
nová→INSERT; beze změn (shodný `content_sha256`)→jen `last_seen_at` + `source_file` + `source_exported_at`;
|
||||
změna→`$set` jen změněných polí + `$push` do `history[]`.
|
||||
|
||||
## Soft-link na investigators (nedestruktivní)
|
||||
pi_email → email/email2 (lower), pak recipient_email, fallback příjmení (bez diakritiky)+země.
|
||||
|
||||
## Použití
|
||||
```
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.2.py --dry-run # folder režim, default složka
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.2.py --apply
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.2.py --folder "<cesta>" --apply
|
||||
.venv\Scripts\python.exe Feasibility\sipiq_import_v1.2.py --csv "<cesta.csv>" --apply # jediný soubor, NEpřesouvá
|
||||
```
|
||||
Default složka `…\77242113UCO2001\ImportSIPIQcompled`; přesun do `Zpracováno` jen v `--apply` + folder režimu.
|
||||
`--scope czsk` (default) / `all`. Default = dry-run.
|
||||
|
||||
## Workflow
|
||||
Uživatel pokládá kompletní SIPIQ reporty (Qualtrics CSV, název
|
||||
`ICONIC+Phase+3b+UC+Study+(77242113UCO3002)_SipIQ_V1_13MAY2026_<datum>_<čas>.csv`) do
|
||||
`ImportSIPIQcompled\`. Po `--apply` se naimportují (delta) a přesunou do `Zpracováno\`.
|
||||
`source_exported_at` se bere z mtime souboru (datum/čas reportu dle filesystému).
|
||||
@@ -0,0 +1,489 @@
|
||||
#!/usr/bin/env python3
|
||||
# -*- coding: utf-8 -*-
|
||||
"""
|
||||
sipiq_import_v1.2.py
|
||||
====================
|
||||
Verze: 1.2
|
||||
Datum: 2026-06-17
|
||||
Autor: Claude Code (pro MUDr. Vladimíra Buzalku)
|
||||
|
||||
Změny proti v1.1
|
||||
----------------
|
||||
- PROVENANCE: ke každé odpovědi se ukládá `source_exported_at` = datum/čas reportu
|
||||
podle FILESYSTÉMU (mtime CSV souboru). Mimo content-hash → nezpůsobuje zbytečné
|
||||
UPDATE; backfilluje se i na "beze změny" cestě. Stará v1.1 ponechána v TRASH.
|
||||
|
||||
Změny proti v1.0
|
||||
----------------
|
||||
- FOLDER WORKFLOW (v1.1): režim --folder sebere *.csv ve složce, naimportuje (delta)
|
||||
a přesune do podsložky `Zpracováno`. Default složka =
|
||||
U:\\PythonProject\\Janssen\\Feasibility\\77242113UCO2001\\ImportSIPIQcompled.
|
||||
|
||||
Popis
|
||||
-----
|
||||
Import SIPIQ odpovědí (Qualtrics CSV export, studie 77242113UCO3002 / ICONIC DAWN)
|
||||
do MongoDB db `feasibility`. Dvě kolekce:
|
||||
* sipiq_questions – slovník dotazníku (1 dok = 1 logická otázka).
|
||||
* sipiq_responses – 1 dok = 1 odpověď (_id = Qualtrics ResponseId), ploché answers{},
|
||||
soft-link investigator_oid, delta bookkeeping + history[].
|
||||
|
||||
DELTA import (přepíše JEN změněná data): nová->insert; beze změn->jen last_seen_at;
|
||||
změna->$set jen změněných polí + push do history[].
|
||||
|
||||
Použití
|
||||
-------
|
||||
python sipiq_import_v1.2.py --dry-run # folder režim, default složka
|
||||
python sipiq_import_v1.2.py --apply
|
||||
python sipiq_import_v1.2.py --folder "<cesta>" --apply
|
||||
python sipiq_import_v1.2.py --csv "<cesta.csv>" --apply # jediný soubor (NEpřesouvá)
|
||||
|
||||
Závislosti: pymongo (.venv). Mongo 192.168.1.76:27017, bez auth.
|
||||
"""
|
||||
import argparse
|
||||
import csv
|
||||
import glob
|
||||
import hashlib
|
||||
import json
|
||||
import os
|
||||
import re
|
||||
import shutil
|
||||
import sys
|
||||
import unicodedata
|
||||
from datetime import datetime, timezone
|
||||
|
||||
try:
|
||||
from pymongo import MongoClient
|
||||
except ImportError:
|
||||
print("CHYBA: pymongo není nainstalován v aktuálním pythonu.", file=sys.stderr)
|
||||
raise
|
||||
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
DB_NAME = "feasibility"
|
||||
COL_Q = "sipiq_questions"
|
||||
COL_R = "sipiq_responses"
|
||||
DEFAULT_FOLDER = r"U:\PythonProject\Janssen\Feasibility\77242113UCO2001\ImportSIPIQcompled"
|
||||
PROCESSED_SUBDIR = "Zpracováno"
|
||||
|
||||
META_COLS = {
|
||||
"StartDate", "EndDate", "Status", "IPAddress", "Progress", "Duration (in seconds)",
|
||||
"Finished", "RecordedDate", "ResponseId", "RecipientLastName", "RecipientFirstName",
|
||||
"RecipientEmail", "ExternalReference", "LocationLatitude", "LocationLongitude",
|
||||
"DistributionChannel", "UserLanguage",
|
||||
}
|
||||
|
||||
PROMOTE = [
|
||||
"site_name", "site_address", "site_city", "site_state", "site_postcode", "site_country",
|
||||
"pi_first_name", "pi_last_name", "pi_phone", "pi_email",
|
||||
"sdl_site_id", "fire_site_id", "fire_investigator_id", "mailinglist_id",
|
||||
"survey_generated_by", "Date", "Time",
|
||||
]
|
||||
|
||||
SECTION_BY_QNUM = {}
|
||||
def _sec(rng, name):
|
||||
for n in rng:
|
||||
SECTION_BY_QNUM[n] = name
|
||||
_sec([2], "J&J Internal Assessment")
|
||||
_sec([6, 7, 8, 9, 10, 11, 12, 13], "Contact Information")
|
||||
_sec(range(14, 22), "Confidentiality Statement")
|
||||
_sec([25, 26, 27], "Interest")
|
||||
_sec([29, 30, 31, 32, 33, 34], "Protocol Requirements")
|
||||
_sec([36, 37, 38], "Enrollment")
|
||||
_sec([40, 41, 42, 43], "Patient Demographics Overview")
|
||||
_sec([45, 46, 47, 48, 49], "Site Overview")
|
||||
_sec([51], "Operational Considerations")
|
||||
_sec([53, 54], "Comments")
|
||||
_sec([57, 58, 59, 60, 61], "Patient Population")
|
||||
_sec([63, 64, 65, 66, 67], "Site Experience and Staffing")
|
||||
_sec([69], "Equipment and Facility Requirements")
|
||||
_sec([71, 72, 73, 74, 75], "Institutional Review Board, Ethics Committee, and Contracts")
|
||||
|
||||
STEM_OVERRIDE = {
|
||||
"Q31": "At your site, at what line(s) of treatment do you most commonly prescribe "
|
||||
"vedolizumab for patients with moderately to severely active ulcerative colitis?",
|
||||
"Q63": "Do you or your site staff have experience in performing the following types of "
|
||||
"study assessments/procedures?",
|
||||
"Q64": "The following personnel are required to run the study. "
|
||||
"Will your site have the following available?",
|
||||
"Q69": "The following equipment and facilities are required to run the studies. "
|
||||
"Are these available at your site?",
|
||||
}
|
||||
|
||||
|
||||
def now_iso():
|
||||
return datetime.now(timezone.utc).astimezone().isoformat(timespec="seconds")
|
||||
|
||||
|
||||
def file_mtime_iso(path):
|
||||
return datetime.fromtimestamp(os.path.getmtime(path)).astimezone().isoformat(timespec="seconds")
|
||||
|
||||
|
||||
def strip_accents(s):
|
||||
if not s:
|
||||
return ""
|
||||
return "".join(c for c in unicodedata.normalize("NFKD", s) if not unicodedata.combining(c))
|
||||
|
||||
|
||||
def norm_name(s):
|
||||
return re.sub(r"\s+", " ", strip_accents(s or "").lower()).strip()
|
||||
|
||||
|
||||
def sanitize_key(qcode):
|
||||
return qcode.replace("#", "_").replace(".", "_")
|
||||
|
||||
|
||||
def qnum(qcode):
|
||||
m = re.match(r"Q(\d+)", qcode)
|
||||
return int(m.group(1)) if m else None
|
||||
|
||||
|
||||
def qbase(qcode):
|
||||
m = re.match(r"(Q\d+)", qcode)
|
||||
return m.group(1) if m else qcode
|
||||
|
||||
|
||||
def import_id(h3_cell):
|
||||
try:
|
||||
return json.loads(h3_cell).get("ImportId", "")
|
||||
except Exception:
|
||||
return h3_cell
|
||||
|
||||
|
||||
def split_text(text):
|
||||
parts = [p.strip() for p in re.split(r"\s+-\s+", text)]
|
||||
stem = parts[0]
|
||||
if len(parts) == 1:
|
||||
return stem, None
|
||||
label_parts = [p for p in parts[1:] if p.lower() != "selected choice"]
|
||||
label_parts = [p for p in label_parts if not re.fullmatch(r"Q\d+#\d+", p)]
|
||||
return stem, (" - ".join(label_parts) if label_parts else None)
|
||||
|
||||
|
||||
def detect_type(qcode, observed):
|
||||
has_hash = "#" in qcode
|
||||
vals = [v for v in observed if v]
|
||||
yesno = vals and all(v in ("Yes", "No") for v in vals)
|
||||
numeric = vals and all(re.fullmatch(r"-?\d+(\.\d+)?", v) for v in vals)
|
||||
if has_hash and yesno:
|
||||
return "matrix_yesno"
|
||||
if has_hash and numeric:
|
||||
return "matrix_percent"
|
||||
if has_hash:
|
||||
return "matrix"
|
||||
if numeric:
|
||||
return "numeric"
|
||||
if yesno:
|
||||
return "yesno"
|
||||
return "single_or_text"
|
||||
|
||||
|
||||
def load_csv(path):
|
||||
with open(path, encoding="utf-8-sig", newline="") as fh:
|
||||
rows = list(csv.reader(fh))
|
||||
h1, h2, h3 = rows[0], rows[1], rows[2]
|
||||
data = rows[3:]
|
||||
cols = [{"i": i, "code": c, "text": t, "qid": import_id(j)}
|
||||
for i, (c, t, j) in enumerate(zip(h1, h2, h3))]
|
||||
return cols, data
|
||||
|
||||
|
||||
def col_getter(cols, data):
|
||||
idx = {c["code"]: c["i"] for c in cols}
|
||||
def get(row, code):
|
||||
i = idx.get(code)
|
||||
return (row[i].strip() if i is not None and i < len(row) else "")
|
||||
return get, idx
|
||||
|
||||
|
||||
def is_question_col(code):
|
||||
return bool(re.match(r"Q\d", code))
|
||||
|
||||
|
||||
def build_questions(cols, data):
|
||||
qcols = [c for c in cols if is_question_col(c["code"])]
|
||||
observed = {c["code"]: set() for c in qcols}
|
||||
for row in data:
|
||||
for c in qcols:
|
||||
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
|
||||
if v:
|
||||
observed[c["code"]].add(v)
|
||||
groups, order_seen = {}, []
|
||||
for c in qcols:
|
||||
base = qbase(c["code"])
|
||||
if base not in groups:
|
||||
groups[base] = {"_id": base, "order": c["i"], "qnum": qnum(c["code"]),
|
||||
"section": SECTION_BY_QNUM.get(qnum(c["code"]), "Other"),
|
||||
"qids": [], "text": split_text(c["text"])[0],
|
||||
"items": [], "_obs": set(), "_types": []}
|
||||
order_seen.append(base)
|
||||
g = groups[base]
|
||||
bq = re.match(r"(QID\d+)", c["qid"] or "")
|
||||
if bq and bq.group(1) not in g["qids"]:
|
||||
g["qids"].append(bq.group(1))
|
||||
_, label = split_text(c["text"])
|
||||
item = {"key": sanitize_key(c["code"]), "qcode": c["code"], "qid": c["qid"]}
|
||||
if label:
|
||||
item["label"] = label
|
||||
g["items"].append(item)
|
||||
g["_obs"] |= observed[c["code"]]
|
||||
g["_types"].append(detect_type(c["code"], observed[c["code"]]))
|
||||
out = []
|
||||
for n, base in enumerate(order_seen):
|
||||
g = groups[base]
|
||||
obs = sorted(g.pop("_obs"))
|
||||
types = g.pop("_types")
|
||||
gtype = max(set(types), key=types.count) if types else "single_or_text"
|
||||
g["type"] = gtype
|
||||
if gtype in ("yesno", "matrix_yesno"):
|
||||
g["options"] = ["Yes", "No"]
|
||||
elif gtype == "single_or_text" and obs and len(obs) <= 12:
|
||||
g["options"] = obs
|
||||
else:
|
||||
g["options"] = []
|
||||
if base in STEM_OVERRIDE:
|
||||
g["text"] = STEM_OVERRIDE[base]
|
||||
g["order"] = n
|
||||
if len(g["items"]) == 1 and "label" not in g["items"][0]:
|
||||
g["items"] = []
|
||||
out.append(g)
|
||||
return out
|
||||
|
||||
|
||||
def build_response(cols, get, row, source_file):
|
||||
rid = get(row, "ResponseId")
|
||||
answers = {}
|
||||
for c in cols:
|
||||
if is_question_col(c["code"]):
|
||||
v = (row[c["i"]].strip() if c["i"] < len(row) else "")
|
||||
if v:
|
||||
answers[sanitize_key(c["code"])] = v
|
||||
meta = {
|
||||
"start_date": get(row, "StartDate") or None,
|
||||
"end_date": get(row, "EndDate") or None,
|
||||
"recorded_date": get(row, "RecordedDate") or None,
|
||||
"status": get(row, "Status") or None,
|
||||
"progress": int(get(row, "Progress")) if get(row, "Progress").isdigit() else (get(row, "Progress") or None),
|
||||
"finished": get(row, "Finished") in ("True", "1", "TRUE"),
|
||||
"duration_sec": int(get(row, "Duration (in seconds)")) if get(row, "Duration (in seconds)").isdigit() else None,
|
||||
"user_language": get(row, "UserLanguage") or None,
|
||||
"distribution_channel": get(row, "DistributionChannel") or None,
|
||||
"ip_address": get(row, "IPAddress") or None,
|
||||
"location_lat": get(row, "LocationLatitude") or None,
|
||||
"location_lng": get(row, "LocationLongitude") or None,
|
||||
"survey_date": get(row, "Date") or None,
|
||||
"survey_time": get(row, "Time") or None,
|
||||
}
|
||||
doc = {
|
||||
"_id": rid, "study": "77242113UCO3002",
|
||||
"site_country": get(row, "site_country") or None,
|
||||
"site_name": get(row, "site_name") or None,
|
||||
"site_city": get(row, "site_city") or None,
|
||||
"site_state": get(row, "site_state") or None,
|
||||
"site_postcode": get(row, "site_postcode") or None,
|
||||
"site_address": get(row, "site_address") or None,
|
||||
"pi_first_name": get(row, "pi_first_name") or None,
|
||||
"pi_last_name": get(row, "pi_last_name") or None,
|
||||
"pi_email": (get(row, "pi_email") or "").lower() or None,
|
||||
"pi_phone": get(row, "pi_phone") or None,
|
||||
"sdl_site_id": get(row, "sdl_site_id") or None,
|
||||
"fire_site_id": get(row, "fire_site_id") or None,
|
||||
"fire_investigator_id": get(row, "fire_investigator_id") or None,
|
||||
"mailinglist_id": get(row, "mailinglist_id") or None,
|
||||
"survey_generated_by": get(row, "survey_generated_by") or None,
|
||||
"recipient_email": (get(row, "RecipientEmail") or "").lower() or None,
|
||||
"recipient_last_name": get(row, "RecipientLastName") or None,
|
||||
"recipient_first_name": get(row, "RecipientFirstName") or None,
|
||||
"meta": meta,
|
||||
"is_full_sipiq": any(k.startswith(("Q57", "Q58", "Q59", "Q63", "Q66", "Q71")) for k in answers),
|
||||
"interested": answers.get("Q25"),
|
||||
"answers": answers,
|
||||
"investigator_oid": None, "investigator_match": None,
|
||||
"source_file": source_file,
|
||||
}
|
||||
return doc
|
||||
|
||||
|
||||
def content_hash(doc):
|
||||
payload = {k: doc[k] for k in doc if k not in
|
||||
("content_sha256", "first_imported_at", "last_seen_at", "last_updated_at",
|
||||
"history", "investigator_oid", "investigator_match", "source_file",
|
||||
"source_exported_at")}
|
||||
return hashlib.sha256(json.dumps(payload, sort_keys=True, ensure_ascii=False,
|
||||
default=str).encode("utf-8")).hexdigest()
|
||||
|
||||
|
||||
def load_investigators(db):
|
||||
inv = list(db.investigators.find(
|
||||
{"zeme": {"$in": ["Czech Republic", "Slovakia"]}},
|
||||
{"prijmeni": 1, "jmeno": 1, "email": 1, "email2": 1, "zeme": 1, "KROK": 1}))
|
||||
by_email, by_name = {}, {}
|
||||
for d in inv:
|
||||
for ef in ("email", "email2"):
|
||||
e = (d.get(ef) or "").lower().strip()
|
||||
if e:
|
||||
by_email.setdefault(e, d)
|
||||
nm = norm_name(d.get("prijmeni"))
|
||||
if nm:
|
||||
by_name.setdefault((nm, d.get("zeme")), []).append(d)
|
||||
return inv, by_email, by_name
|
||||
|
||||
|
||||
def soft_link(doc, by_email, by_name):
|
||||
e = (doc.get("pi_email") or "").lower().strip()
|
||||
if e and e in by_email:
|
||||
d = by_email[e]; return d["_id"], f"email:{e}", d
|
||||
e2 = (doc.get("recipient_email") or "").lower().strip()
|
||||
if e2 and e2 in by_email:
|
||||
d = by_email[e2]; return d["_id"], f"recipient_email:{e2}", d
|
||||
nm = norm_name(doc.get("pi_last_name"))
|
||||
cand = by_name.get((nm, doc.get("site_country")), [])
|
||||
if len(cand) == 1:
|
||||
return cand[0]["_id"], f"prijmeni:{nm}", cand[0]
|
||||
if len(cand) > 1:
|
||||
return None, f"prijmeni_ambiguous:{nm}({len(cand)})", None
|
||||
return None, "NENALEZENO", None
|
||||
|
||||
|
||||
def diff_docs(old, new):
|
||||
changes = []
|
||||
def walk(prefix, o, n):
|
||||
for k in sorted(set((o or {}).keys()) | set((n or {}).keys())):
|
||||
ov, nv = (o or {}).get(k), (n or {}).get(k)
|
||||
if isinstance(ov, dict) or isinstance(nv, dict):
|
||||
walk(f"{prefix}{k}.", ov or {}, nv or {})
|
||||
elif ov != nv:
|
||||
changes.append({"key": f"{prefix}{k}", "old": ov, "new": nv})
|
||||
for field in ("answers", "meta"):
|
||||
walk(f"{field}.", old.get(field, {}), new.get(field, {}))
|
||||
for k in ("site_name", "pi_email", "pi_last_name", "interested", "is_full_sipiq"):
|
||||
if old.get(k) != new.get(k):
|
||||
changes.append({"key": k, "old": old.get(k), "new": new.get(k)})
|
||||
return changes
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
def process_file(db, csv_path, scope, dry, by_email, by_name):
|
||||
source_file = os.path.basename(csv_path)
|
||||
exported_at = file_mtime_iso(csv_path) # datum/čas reportu dle filesystému (mtime)
|
||||
cols, data = load_csv(csv_path)
|
||||
get, _ = col_getter(cols, data)
|
||||
if scope == "czsk":
|
||||
data = [r for r in data if get(r, "site_country") in ("Czech Republic", "Slovakia")]
|
||||
print(f"\n########## {source_file} (rozsah={scope}, odpovědí={len(data)}, export={exported_at}) ##########")
|
||||
|
||||
cols_all, data_all = load_csv(csv_path)
|
||||
questions = build_questions(cols_all, data_all)
|
||||
|
||||
docs, link_rows = [], []
|
||||
for r in data:
|
||||
doc = build_response(cols, get, r, source_file)
|
||||
oid, how, matched = soft_link(doc, by_email, by_name)
|
||||
doc["investigator_oid"] = oid
|
||||
doc["investigator_match"] = how
|
||||
doc["source_exported_at"] = exported_at
|
||||
doc["content_sha256"] = content_hash(doc)
|
||||
docs.append(doc)
|
||||
link_rows.append((doc, how, matched))
|
||||
|
||||
existing = {d["_id"]: d for d in db[COL_R].find({}, {"content_sha256": 1})}
|
||||
to_insert = [d for d in docs if d["_id"] not in existing]
|
||||
to_update = [d for d in docs if d["_id"] in existing and existing[d["_id"]].get("content_sha256") != d["content_sha256"]]
|
||||
unchanged = [d for d in docs if d["_id"] in existing and existing[d["_id"]].get("content_sha256") == d["content_sha256"]]
|
||||
|
||||
mk7 = mko = un = 0
|
||||
for doc, how, m in link_rows:
|
||||
krok = (m or {}).get("KROK", "")
|
||||
if m and str(krok).startswith("7"): mk7 += 1
|
||||
elif m: mko += 1
|
||||
else: un += 1
|
||||
print(f" slovník: {len(questions)} otázek | soft-link: KROK7={mk7}, jiný={mko}, nenapárováno={un}")
|
||||
print(f" delta: INSERT={len(to_insert)}, UPDATE={len(to_update)}, beze změny={len(unchanged)}")
|
||||
if un:
|
||||
for doc, how, m in link_rows:
|
||||
if not m:
|
||||
print(f" ✗ NENAPÁROVÁNO: {doc.get('pi_last_name')} / {doc.get('pi_email')} ({how})")
|
||||
|
||||
if dry:
|
||||
print(" [DRY-RUN] nezapsáno")
|
||||
return {"insert": 0, "update": 0, "unchanged": 0, "wrote": False}
|
||||
|
||||
for q in questions:
|
||||
db[COL_Q].replace_one({"_id": q["_id"]}, q, upsert=True)
|
||||
ts = now_iso()
|
||||
ni = nu = ns = 0
|
||||
for d in docs:
|
||||
cur = db[COL_R].find_one({"_id": d["_id"]})
|
||||
if cur is None:
|
||||
d.update({"first_imported_at": ts, "last_seen_at": ts, "last_updated_at": ts, "history": []})
|
||||
db[COL_R].insert_one(d); ni += 1
|
||||
elif cur.get("content_sha256") != d["content_sha256"]:
|
||||
changes = diff_docs(cur, d)
|
||||
db[COL_R].update_one({"_id": d["_id"]}, {
|
||||
"$set": {**{k: d[k] for k in d if k != "_id"}, "last_seen_at": ts, "last_updated_at": ts},
|
||||
"$push": {"history": {"changed_at": ts, "source_file": source_file, "changes": changes}}})
|
||||
nu += 1
|
||||
else:
|
||||
db[COL_R].update_one({"_id": d["_id"]}, {"$set": {
|
||||
"last_seen_at": ts, "source_file": source_file, "source_exported_at": d["source_exported_at"]}})
|
||||
ns += 1
|
||||
print(f" [APPLY] questions upsert={len(questions)} | responses insert={ni}, update={nu}, beze změny={ns}")
|
||||
return {"insert": ni, "update": nu, "unchanged": ns, "wrote": True}
|
||||
|
||||
|
||||
def move_to_processed(csv_path, folder):
|
||||
dest_dir = os.path.join(folder, PROCESSED_SUBDIR)
|
||||
os.makedirs(dest_dir, exist_ok=True)
|
||||
base = os.path.basename(csv_path)
|
||||
dest = os.path.join(dest_dir, base)
|
||||
if os.path.exists(dest):
|
||||
stem, ext = os.path.splitext(base)
|
||||
n = 1
|
||||
while os.path.exists(os.path.join(dest_dir, f"{stem}_{n}{ext}")):
|
||||
n += 1
|
||||
dest = os.path.join(dest_dir, f"{stem}_{n}{ext}")
|
||||
shutil.move(csv_path, dest)
|
||||
print(f" -> přesunuto do {PROCESSED_SUBDIR}\\{os.path.basename(dest)}")
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser()
|
||||
ap.add_argument("--csv", help="jediný soubor (NEpřesouvá)")
|
||||
ap.add_argument("--folder", default=DEFAULT_FOLDER, help="složka se SIPIQ CSV (přesune do Zpracováno)")
|
||||
ap.add_argument("--scope", choices=["czsk", "all"], default="czsk")
|
||||
ap.add_argument("--apply", action="store_true")
|
||||
ap.add_argument("--dry-run", action="store_true")
|
||||
args = ap.parse_args()
|
||||
dry = not args.apply
|
||||
|
||||
if args.csv:
|
||||
files, move_mode, folder = [args.csv], False, None
|
||||
else:
|
||||
folder = args.folder
|
||||
files = sorted(glob.glob(os.path.join(folder, "*.csv")))
|
||||
move_mode = True
|
||||
print(f"Složka: {folder}\nNalezeno CSV ke zpracování: {len(files)}")
|
||||
if not files:
|
||||
print("Nic ke zpracování (žádné *.csv).")
|
||||
return
|
||||
|
||||
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=8000)
|
||||
db = client[DB_NAME]
|
||||
client.admin.command("ping")
|
||||
inv, by_email, by_name = load_investigators(db)
|
||||
print(f"Investigatorů CZ+SK v DB: {len(inv)}")
|
||||
|
||||
total = {"insert": 0, "update": 0, "unchanged": 0}
|
||||
for f in files:
|
||||
res = process_file(db, f, args.scope, dry, by_email, by_name)
|
||||
for k in total:
|
||||
total[k] += res[k]
|
||||
if move_mode and res["wrote"]:
|
||||
move_to_processed(f, folder)
|
||||
|
||||
print(f"\n=== CELKEM: insert={total['insert']}, update={total['update']}, beze změny={total['unchanged']} ===")
|
||||
if dry:
|
||||
print("[DRY-RUN] Nic se nezapsalo ani nepřesunulo. Ostrý běh: --apply")
|
||||
client.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,38 @@
|
||||
# store_cda_seaweed_v1.0.py
|
||||
|
||||
**Verze:** 1.0 · **Datum:** 2026-06-17
|
||||
|
||||
## Účel
|
||||
Uloží podepsané CDA (PDF) z e-mailů asistentek (CTA) do Mongo
|
||||
`feasibility.investigators` do pole `cda.*` a posune lékaře na
|
||||
`KROK "5 - CDA podepsano"`.
|
||||
|
||||
Na rozdíl od `store_cda_batch` (stahuje `.msg` přes SFTP z Toweru a tahá přílohu
|
||||
přes `extract_msg`) tahle verze stahuje PDF **přímo ze SeaweedFS** přes
|
||||
`seaweed_url`, který parser ukládá k příloze v `emaily."vbuzalka@its.jnj.com"`
|
||||
(`attachments[].seaweed_url` + `sha256`). Jednodušší, bez SFTP.
|
||||
|
||||
## Jak to funguje
|
||||
- `MAPPING` = explicitní párování `investigator _id → (seaweed_url, filename, sha256, size, source_msg_id)`.
|
||||
- Pro každý záznam: stáhne PDF (urllib), ověří **SHA256 + velikost + PDF hlavičku**,
|
||||
base64-zakóduje a uloží do `cda`:
|
||||
`data_base64, data_sha256, data_filename, data_mime, data_size, data_stored_at,
|
||||
data_source_msg` + metadata `stav="podepsano", soubor, zdroj`.
|
||||
- Nastaví `KROK = "5 - CDA podepsano"` a předřadí řádek do `STATUS`.
|
||||
- `_id` se konvertuje na `ObjectId` (čisté pymongo nekonvertuje string→ObjectId samo).
|
||||
|
||||
## Použití
|
||||
```
|
||||
.venv\Scripts\python.exe Feasibility\store_cda_seaweed_v1.0.py # dry-run (ověří stažení+SHA, nezapisuje)
|
||||
.venv\Scripts\python.exe Feasibility\store_cda_seaweed_v1.0.py --apply # zapíše do Mongo
|
||||
```
|
||||
|
||||
## Běh 17JUN2026 (--apply)
|
||||
Uloženo 5/5 (všechny SHA256 OK), KROK 4 → 5:
|
||||
Závada Filip, Bruncák Michal (FNsP B. Bystrica), Machytka Evžen (Asclepiades),
|
||||
Pumprla Jiří (PreventaMed), Zapotocká Júlia (PAV-MED).
|
||||
GASTROMART/Molnár přeskočen (už KROK 6, CDA dříve uloženo).
|
||||
|
||||
## Závislosti
|
||||
`pymongo`, `bson` (+ stdlib). SeaweedFS volume server `192.168.1.50:8888`.
|
||||
Mongo `192.168.1.76:27017`.
|
||||
@@ -0,0 +1,126 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# =============================================================================
|
||||
# Nazev: store_cda_seaweed_v1.0.py
|
||||
# Verze: 1.0
|
||||
# Datum: 2026-06-17
|
||||
# Popis: Ulozi podepsane CDA (PDF) z e-mailu asistentek do Mongo
|
||||
# feasibility.investigators do pole cda.* a posune lekare na
|
||||
# KROK "5 - CDA podepsano". PDF se stahuji primo ze SeaweedFS
|
||||
# (seaweed_url z attachments v emaily."vbuzalka@its.jnj.com"),
|
||||
# overuje se SHA256 proti metadatum z Mongo.
|
||||
# Pouziti: python store_cda_seaweed_v1.0.py (dry-run / nahled)
|
||||
# python store_cda_seaweed_v1.0.py --apply (zapise do Mongo)
|
||||
# Pozn.: MAPPING nize = explicitni parovani investigator -> CDA priloha.
|
||||
# Jen stdlib + pymongo. SeaweedFS host 192.168.1.50:8888.
|
||||
# =============================================================================
|
||||
|
||||
import sys
|
||||
import base64
|
||||
import hashlib
|
||||
import urllib.request
|
||||
from datetime import datetime, timezone
|
||||
from pymongo import MongoClient
|
||||
from bson import ObjectId
|
||||
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
DBN, COL = "feasibility", "investigators"
|
||||
|
||||
# (investigator _id, seaweed_url, filename, sha256, size, source_msg_id, label)
|
||||
MAPPING = [
|
||||
("6a198b661218c31ab0f5ba57",
|
||||
"http://192.168.1.50:8888/mail-attachments/1a/86/1a86e987b9d3da57c1d863b47734133f2e2d7eae3f5cfe91112c475eb86d86e9",
|
||||
"CZ_CDA PI_MUDr. Filip Zavada_fully signed_16Jun2026.pdf",
|
||||
"1a86e987b9d3da57c1d863b47734133f2e2d7eae3f5cfe91112c475eb86d86e9",
|
||||
479026, "<CH2PR07MB7190A5538ACDC1D49F8B430780E52@CH2PR07MB7190.namprd07.prod.outlook.com>",
|
||||
"Zavada Filip"),
|
||||
("6a19832b5fc2213518257957",
|
||||
"http://192.168.1.50:8888/mail-attachments/64/b0/64b06d48bfe3c49095e326988f14c04fd5849728b227647f6653b2e3c3095538",
|
||||
"SK_CDA PI_Bruncak_FNsP BBystrica_fully signed 16Jun2026.pdf",
|
||||
"64b06d48bfe3c49095e326988f14c04fd5849728b227647f6653b2e3c3095538",
|
||||
498069, "<SA1PR07MB952874B8654156369CDE44448CE52@SA1PR07MB9528.namprd07.prod.outlook.com>",
|
||||
"Bruncak Michal"),
|
||||
("6a19832b5fc2213518257961",
|
||||
"http://192.168.1.50:8888/mail-attachments/c2/72/c272ca62bd27ca10aed35cb54054d880f4f0e2f59940ed3b067b17d51a9ac041",
|
||||
"CZ_CDA Institution_Asclepiades s.r.o._MUDr. Machytka_16Jun2026.pdf",
|
||||
"c272ca62bd27ca10aed35cb54054d880f4f0e2f59940ed3b067b17d51a9ac041",
|
||||
460977, "<PH0PR07MB97879A9C9BF9C00D38D4798A9FE52@PH0PR07MB9787.namprd07.prod.outlook.com>",
|
||||
"Machytka Evzen (Asclepiades)"),
|
||||
("6a19832b5fc2213518257967",
|
||||
"http://192.168.1.50:8888/mail-attachments/99/37/99372c399be3b001428ef4b36d43e250dedced5955de5d1f3a2d63a9f0c1728b",
|
||||
"CZ_CDA institution_PreventaMed sro_fully signed_16Jun2026.pdf",
|
||||
"99372c399be3b001428ef4b36d43e250dedced5955de5d1f3a2d63a9f0c1728b",
|
||||
457745, "<CH2PR07MB719008DB0B3CAFD764AE2E8280E52@CH2PR07MB7190.namprd07.prod.outlook.com>",
|
||||
"Pumprla Jiri (PreventaMed)"),
|
||||
("6a1c4275aa46d8b608065ce9",
|
||||
"http://192.168.1.50:8888/mail-attachments/94/95/9495c742407873efd8dd9713e1dc962cb08e55e0d3690e4a79a90132ee358dee",
|
||||
"SK_CDA Institution_PAV-MED s r.o_fully signed_15Jun2026.pdf",
|
||||
"9495c742407873efd8dd9713e1dc962cb08e55e0d3690e4a79a90132ee358dee",
|
||||
460246, "<CH2PR07MB719008DB0B3CAFD764AE2E8280E52@CH2PR07MB7190.namprd07.prod.outlook.com>",
|
||||
"Zapotocka Julia (PAV-MED)"),
|
||||
]
|
||||
|
||||
|
||||
def fetch(url):
|
||||
with urllib.request.urlopen(url, timeout=30) as r:
|
||||
return r.read()
|
||||
|
||||
|
||||
def main():
|
||||
apply = "--apply" in sys.argv
|
||||
cli = MongoClient(MONGO_URI)
|
||||
col = cli[DBN][COL]
|
||||
now = datetime.now(timezone.utc).isoformat()
|
||||
|
||||
ok = 0
|
||||
for _id, url, fname, sha, size, src, label in MAPPING:
|
||||
oid = ObjectId(_id)
|
||||
doc = col.find_one({"_id": oid}, {"STATUS": 1, "KROK": 1, "cda.stav": 1})
|
||||
if not doc:
|
||||
print(f" !! {label}: investigator _id={_id} NENALEZEN"); continue
|
||||
try:
|
||||
raw = fetch(url)
|
||||
except Exception as e:
|
||||
print(f" !! {label}: stazeni selhalo: {e}"); continue
|
||||
got = hashlib.sha256(raw).hexdigest()
|
||||
sha_ok = (got == sha)
|
||||
size_ok = (len(raw) == size)
|
||||
head_ok = raw[:5] == b"%PDF-"
|
||||
print(f" [{label}]")
|
||||
print(f" soubor : {fname}")
|
||||
print(f" stazeno : {len(raw)} B (ocek. {size}) {'OK' if size_ok else 'MISMATCH'}")
|
||||
print(f" sha256 : {'OK' if sha_ok else 'MISMATCH! ' + got}")
|
||||
print(f" PDF hdr : {'OK' if head_ok else 'NENI PDF'}")
|
||||
print(f" KROK : {doc.get('KROK')} -> 5 - CDA podepsano")
|
||||
if not (sha_ok and size_ok and head_ok):
|
||||
print(" >> PRESKAKUJI (kontrola selhala)"); continue
|
||||
if not apply:
|
||||
ok += 1; continue
|
||||
|
||||
b64 = base64.b64encode(raw).decode("ascii")
|
||||
old_status = doc.get("STATUS", "") or ""
|
||||
new_line = (f"17JUN2026: podepsane CDA ULOZENO do Mongo (cda.data) — {fname} "
|
||||
f"(z e-mailu asistentky). KROK 5, pripraveno na SIPIQ.")
|
||||
col.update_one({"_id": oid}, {"$set": {
|
||||
"KROK": "5 - CDA podepsano",
|
||||
"STATUS": new_line + "\n" + old_status,
|
||||
"cda.stav": "podepsano",
|
||||
"cda.soubor": fname,
|
||||
"cda.zdroj": "e-mail asistentky (SeaweedFS)",
|
||||
"cda.data_base64": b64,
|
||||
"cda.data_sha256": sha,
|
||||
"cda.data_filename": fname,
|
||||
"cda.data_mime": "application/pdf",
|
||||
"cda.data_size": len(raw),
|
||||
"cda.data_stored_at": now,
|
||||
"cda.data_source_msg": src,
|
||||
}})
|
||||
ok += 1
|
||||
print(" >> ULOZENO + KROK 5")
|
||||
|
||||
print(f"\n{'ZAPSANO' if apply else 'DRY-RUN OK'}: {ok}/{len(MAPPING)}")
|
||||
if not apply:
|
||||
print(">>> Pro zapis spust s --apply")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user