1376 lines
57 KiB
Python
1376 lines
57 KiB
Python
"""
|
|
jnj_tower_ingest v1.5
|
|
Nazev: jnj_tower_ingest_v1.5.py
|
|
Verze: 1.5.0
|
|
Datum: 2026-06-17
|
|
Autor: vladimir.buzalka
|
|
|
|
ZMENA 1.5:
|
|
PARSE cte navic MAPI PrimarySendAccount (0x0E28) + SentRepresenting email
|
|
(0x0065) PRIMO z OLE (extract_msg je u stringovych props nevraci) a uklada
|
|
je do pole send_account. Kdyz send_account obsahuje "buzalka.cz", nastavi
|
|
send_failed=True + send_error="SendAs buzalka.cz (PrimarySendAccount)".
|
|
Toto je SPOLEHLIVY priznak neodeslane (SendAs-denied) zpravy — na rozdil od
|
|
detekce v tele (1.4), ktera u Sent kopie casto chybi. Rychly dotaz na
|
|
neodeslane: { send_failed: true }. (Historicke doplneno backfill_send_failed.)
|
|
|
|
ZMENA 1.4:
|
|
(a) PARSE detekuje NEODESLANY e-mail: kdyz telo obsahuje stopy chyby
|
|
odeslani (SendAsDenied / "could not be sent" / "TransportSend operation
|
|
has failed" / MapiExceptionSendAsDenied), dokument dostane
|
|
send_failed=true + send_error (vc. kodu ec=). Dotaz na neodeslane:
|
|
{send_failed: true}. (Telo s chybou doteche az re-uploadem z
|
|
jnj_mailbox_sync v1.3 + overwrite na app.py v2.4.)
|
|
(b) Nova FAZE RECONCILE (--reconcile): smaze PROVIZORNI duplikat. Sent
|
|
polozka bez Message-ID (_id zacina filename:/entryid:) je jen prechodny
|
|
snimek; kdyz k ni existuje "dvojce" s REALNYM Message-ID (stejni 'to'
|
|
prijemci + stejny normalized_subject + received_at do 24h), je provizorni
|
|
kopie redundantni -> smaze se. Neodeslane (bez dvojcete) ZUSTANOU. Bezi
|
|
jen s --reconcile; --dry-run = jen plan (nic nemaze). Konzervativni match
|
|
na stabilnim obsahu, ne na EntryID.
|
|
|
|
ZMENA 1.3:
|
|
PARSE faze pri extrakci priloh zaroven nahraje binarku do SeaweedFS
|
|
(Tower1) pres sdileny seaweed_store.py; kazda priloha dostane sha256 +
|
|
seaweed_path + seaweed_url. Dokument s prilohami dostane doc-level
|
|
seaweed_synced_at. Vypadek SeaweedFS parse neshodi (jen warning).
|
|
Backfill jiz naparsovanych: seaweed_attachments_backfill_jnj.py.
|
|
|
|
Popis:
|
|
Sjednoceny Tower-side ingest JNJ e-mailu. Spojuje tri drive oddelene
|
|
casti do jednoho behu (vse bezi v kontejneru python-runner u Monga):
|
|
|
|
FAZE 1 — PARSE (drive parse_emails_tower_v1.3.py):
|
|
.msg soubory z /mnt/JNJEMAILS -> dokument v Mongo
|
|
emaily."vbuzalka@its.jnj.com" (bohata extrakce: telo, prilohy,
|
|
hlavicky, MAPI props, ...). _id = Internet Message-ID.
|
|
INKREMENTALNE: parsuje jen soubory novejsi nez mtime watermark
|
|
(jnj_sync_state/_id="parse_state"). Prvni beh = seed dle filename
|
|
v Mongu. --full reparsuje vse.
|
|
|
|
FAZE 2 — SYNC (drive sync_jnj_state_v1.0.py):
|
|
nejnovejsi /mnt/JNJEMAILS/db/jnjemails_*.db (SQLite, JEN CTENI ro)
|
|
-> zrcadlo do Mongo kolekce 'jnj_messages' (upsert)
|
|
-> doplneni cesty/stavu do emaily."vbuzalka@its.jnj.com":
|
|
jnj_folder = COALESCE(jnj_folder, folder)
|
|
jnj_is_read, jnj_not_in_mailbox, jnj_left_mailbox_at,
|
|
jnj_folder_synced_at (match _id==message_id, fallback
|
|
filename; BEZ upsertu — nezakladame stuby).
|
|
Inkrementalne pres watermark updated_at (jnj_sync_state/_id=
|
|
"watermark") + zkratka last_db (stejna DB -> hned no-op).
|
|
|
|
FAZE 3 — ENRICH (drive jnj_emails_to_fulltext_v1.0.py):
|
|
doindexuje JNJ schranku do PG fulltextu zavolanim SDILENEHO
|
|
skriptu 5_enrich_fulltext_emails_vX.Y.py --mailbox
|
|
"vbuzalka@its.jnj.com" (stejny extractor jako Graph pipeline ->
|
|
konzistentni schema). Verze enrich se auto-detekuje (nejnovejsi
|
|
/scripts/5_enrich_fulltext_emails_v*.py). Spousti se JEN kdyz
|
|
parse pridal nove dokumenty (jinak preskok — JNJ stejne enrichuje
|
|
pipeline v 6:00/18:00). --no-enrich vypne, --enrich-always vynuti.
|
|
|
|
PORADI: parse -> sync -> enrich. Cerstve naparsovane maily dostanou cestu
|
|
(sync) i fulltext (enrich) hned ve stejnem behu (drive: pokud sync/enrich
|
|
predbehl parse, novy mail nemel co zpracovat). Tri nezavisle udalosti
|
|
(nova .msg / nova .db / nove doc pro PG) -> skript udela jen to, co ma
|
|
praci; jinak levny no-op (vhodne pro cron kazdych 5 minut).
|
|
|
|
Spojovaci klic vsude = Internet Message-ID = Mongo _id.
|
|
|
|
Prostredi:
|
|
Docker container "python-runner" na Unraid Tower.
|
|
/mnt/user/JNJEMAILS -> /mnt/JNJEMAILS (.msg v rootu, .db v db/)
|
|
MongoDB 192.168.1.76:27017 (externi).
|
|
|
|
Argumenty:
|
|
--dry-run nic nezapise, jen spocita a vypise plan vsech fazi
|
|
--full parse: reparsuj vse; sync: ignoruj watermark
|
|
--limit N max N souboru (parse) / radku (sync) — test
|
|
--reindex vynut vytvoreni indexu na konci parse faze
|
|
--force sync: ignoruj zkratku last_db (zpracuj i hotovou DB)
|
|
--parse-only spust jen fazi PARSE
|
|
--sync-only spust jen fazi SYNC
|
|
--enrich-only spust jen fazi ENRICH (vynuti enrich i bez novych dat)
|
|
--no-enrich preskoc fazi ENRICH
|
|
--enrich-always spust enrich i kdyz parse nepridal nove dokumenty
|
|
|
|
Spousteni (v kontejneru python-runner):
|
|
# Test:
|
|
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.1.py --dry-run
|
|
# Ostry inkrementalni beh (cron):
|
|
docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.1.py
|
|
# Plny reparse + reindex:
|
|
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.1.py --full --reindex
|
|
|
|
Zavislosti (v image python-runner):
|
|
extract-msg==0.55.0, olefile, pymongo, python-dateutil, sqlite3 (stdlib).
|
|
Enrich faze deleguje na 5_enrich_fulltext_emails (psycopg, bs4 v image).
|
|
Python 3.10+.
|
|
|
|
Historie verzi:
|
|
1.0.0 2026-06-10 Sjednoceni parse_emails_tower_v1.3 + sync_jnj_state_v1.0
|
|
do jedineho skriptu. Parse zinkrementalnen pres mtime
|
|
watermark (drive scan celeho adresare kazdy beh).
|
|
Indexy jen pri full/seed/--reindex. Poradi parse->sync.
|
|
1.1.0 2026-06-10 + FAZE 3 ENRICH: deleguje na sdileny
|
|
5_enrich_fulltext_emails --mailbox (auto-detekce verze),
|
|
jen kdyz parse pridal nove dokumenty. Nahrazuje
|
|
jnj_emails_to_fulltext_v1.0.py (ten -> Trash).
|
|
Flagy --enrich-only/--no-enrich/--enrich-always.
|
|
1.2.0 2026-06-10 SYNC NULL-safe: stary inbox_full_sync zapisuje radky s
|
|
updated_at=NULL; watermark filtr "updated_at > wm" je
|
|
tise zahazoval (NULL > x = false) -> maily mely telo ale
|
|
nikdy nedostaly jnj_folder. Nyni se beru i radky s
|
|
updated_at IS NULL, ktere jeste nejsou v jnj_messages
|
|
(zpracuji se prave jednou). Nic uz se tise nezahodi.
|
|
1.3.0 2026-06-13 PARSE: prilohy do SeaweedFS (sha256 + seaweed_path/url).
|
|
1.4.0 2026-06-16 (a) PARSE detekuje neodeslany e-mail -> send_failed +
|
|
send_error (SendAsDenied marker v tele). (b) Nova faze
|
|
RECONCILE (--reconcile): smaze provizorni no-ID Sent
|
|
kopie, ke kterym existuje dvojce s realnym Message-ID
|
|
(match to+subjekt+cas, ne EntryID); neodeslane ponecha.
|
|
"""
|
|
|
|
import sys
|
|
import os
|
|
import re
|
|
import glob
|
|
import hashlib
|
|
import logging
|
|
import argparse
|
|
import base64
|
|
import struct
|
|
import sqlite3
|
|
import subprocess
|
|
from pathlib import Path
|
|
from datetime import datetime, timezone
|
|
from typing import Optional
|
|
|
|
import extract_msg
|
|
from extract_msg.enums import ErrorBehavior
|
|
import olefile
|
|
from dateutil import parser as dtparser
|
|
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
|
|
|
|
sys.path.insert(0, str(Path(__file__).resolve().parent))
|
|
import seaweed_store as sw
|
|
|
|
if hasattr(sys.stdout, "reconfigure"):
|
|
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
|
|
|
|
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
|
|
MSGS_DIR = Path("/mnt/JNJEMAILS")
|
|
DB_DIR = "/mnt/JNJEMAILS/db"
|
|
MONGO_URI = "mongodb://192.168.1.76:27017"
|
|
MONGO_DB = "emaily"
|
|
EMAILS_COL = "vbuzalka@its.jnj.com"
|
|
MIRROR_COL = "jnj_messages"
|
|
STATE_COL = "jnj_sync_state"
|
|
BATCH_SIZE = 200
|
|
LOG_FILE = Path(__file__).parent / "jnj_tower_ingest_errors.log"
|
|
ENRICH_GLOB = "/scripts/5_enrich_fulltext_emails_v*.py" # sdileny PG enrich
|
|
SCRIPT_VERSION = "1.5.0"
|
|
|
|
# Stopy chyby odeslani v tele (.msg neodeslaneho e-mailu) — viz hustak/SendAsDenied
|
|
SEND_FAIL_MARKERS = (
|
|
"MapiExceptionSendAsDenied", "SendAsDeniedException",
|
|
"could not be sent", "TransportSend operation has failed",
|
|
"Transport-Send failed",
|
|
)
|
|
|
|
# Sloupce zrcadlene ze SQLite messages -> jnj_messages
|
|
ROW_COLS = ["message_id", "subject", "sender", "received_at", "folder",
|
|
"jnj_folder", "is_read", "not_in_mailbox_anymore", "left_mailbox_at",
|
|
"entry_id", "graph_id", "updated_at", "source"]
|
|
# ──────────────────────────────────────────────────────────────────────────────
|
|
|
|
logging.basicConfig(
|
|
filename=str(LOG_FILE),
|
|
level=logging.ERROR,
|
|
format="%(asctime)s | %(message)s",
|
|
datefmt="%Y-%m-%d %H:%M:%S",
|
|
encoding="utf-8",
|
|
)
|
|
|
|
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
# FAZE 1 — PARSE (.msg -> Mongo emaily) [drive parse_emails_tower_v1.3.py]
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
|
|
def safe(obj, *attrs, default=None):
|
|
"""Bezpecne cteni atributu — vrati prvni non-None hodnotu."""
|
|
for attr in attrs:
|
|
try:
|
|
val = getattr(obj, attr, None)
|
|
if val is None:
|
|
continue
|
|
if isinstance(val, str) and not val.strip():
|
|
continue
|
|
return val
|
|
except Exception:
|
|
continue
|
|
return default
|
|
|
|
|
|
def parse_date(raw) -> Optional[datetime]:
|
|
"""Libovolny datum -> UTC datetime bez tzinfo (pro MongoDB)."""
|
|
if raw is None:
|
|
return None
|
|
if isinstance(raw, datetime):
|
|
if raw.tzinfo:
|
|
return raw.astimezone(timezone.utc).replace(tzinfo=None)
|
|
return raw
|
|
try:
|
|
dt = dtparser.parse(str(raw))
|
|
if dt.tzinfo:
|
|
return dt.astimezone(timezone.utc).replace(tzinfo=None)
|
|
return dt
|
|
except Exception:
|
|
return None
|
|
|
|
|
|
_INT64_MIN, _INT64_MAX = -(2 ** 63), 2 ** 63 - 1
|
|
|
|
|
|
def to_bson(val):
|
|
"""Konvertuje hodnotu na BSON-serializovatelny typ.
|
|
|
|
Pozor: BSON umi jen signed int64. Python ma neomezene integery, takze
|
|
velke MAPI hodnoty (PR_CHANGE_KEY, FILETIME, 64-bit handle) mimo rozsah
|
|
int64 prevadime na string — jinak cely bulk_write spadne na
|
|
'MongoDB can only handle up to 8-byte ints'.
|
|
"""
|
|
# bool musi byt PRED int (isinstance(True, int) == True)
|
|
if isinstance(val, bool):
|
|
return val
|
|
if isinstance(val, bytes):
|
|
return val.hex() if len(val) <= 128 else f"<bytes:{len(val)}>"
|
|
if isinstance(val, datetime):
|
|
return parse_date(val)
|
|
if isinstance(val, int):
|
|
return val if _INT64_MIN <= val <= _INT64_MAX else str(val)
|
|
if isinstance(val, (str, float, type(None))):
|
|
return val
|
|
if isinstance(val, list):
|
|
return [to_bson(v) for v in val]
|
|
try:
|
|
iv = int(val)
|
|
return iv if _INT64_MIN <= iv <= _INT64_MAX else str(iv)
|
|
except Exception:
|
|
pass
|
|
return str(val)
|
|
|
|
|
|
def extract_headers(msg) -> dict:
|
|
headers = {}
|
|
try:
|
|
hdr = msg.header
|
|
if not hdr:
|
|
return {}
|
|
from email.header import decode_header as _dh
|
|
|
|
def _decode(v: str) -> str:
|
|
try:
|
|
parts = _dh(v)
|
|
out = ""
|
|
for part, enc in parts:
|
|
out += part.decode(enc or "utf-8", errors="replace") if isinstance(part, bytes) else part
|
|
return out
|
|
except Exception:
|
|
return v
|
|
|
|
for key in set(hdr.keys()):
|
|
k = key.lower().replace("-", "_")
|
|
vals = [_decode(v) for v in hdr.get_all(key, [])]
|
|
headers[k] = vals if len(vals) > 1 else (vals[0] if vals else "")
|
|
except Exception as e:
|
|
logging.error("extract_headers: %s", e)
|
|
return headers
|
|
|
|
|
|
def extract_recipients(msg) -> list:
|
|
result = []
|
|
type_map = {1: "to", 2: "cc", 3: "bcc"}
|
|
try:
|
|
for r in msg.recipients:
|
|
rtype = getattr(r, "type", 1)
|
|
try:
|
|
rtype = int(rtype)
|
|
except Exception:
|
|
try:
|
|
rtype = int(rtype.value)
|
|
except Exception:
|
|
rtype = 1
|
|
rec = {
|
|
"type": type_map.get(rtype, "to"),
|
|
"email": safe(r, "email", default=""),
|
|
"name": safe(r, "name", default=""),
|
|
}
|
|
result.append(rec)
|
|
except Exception as e:
|
|
logging.error("extract_recipients: %s", e)
|
|
return result
|
|
|
|
|
|
def extract_attachments(msg) -> list:
|
|
result = []
|
|
try:
|
|
for att in msg.attachments:
|
|
fname = safe(att, "longFilename", "shortFilename", default="")
|
|
if not fname:
|
|
continue
|
|
size = 0
|
|
raw = None
|
|
try:
|
|
d = att.data
|
|
if isinstance(d, (bytes, bytearray)):
|
|
raw = bytes(d)
|
|
size = len(raw)
|
|
elif d:
|
|
size = len(d) # embedded message apod. — bez bajtu
|
|
except Exception:
|
|
pass
|
|
mime = safe(att, "mimetype", "mimeType", default="application/octet-stream")
|
|
entry = {
|
|
"filename": fname,
|
|
"size_bytes": size,
|
|
"mime_type": mime,
|
|
"content_id": safe(att, "cid", default=None),
|
|
"is_inline": bool(safe(att, "isInline", default=False)),
|
|
}
|
|
# SeaweedFS upload (dedup dle obsahu, sdilene s Graph/mailstore vetvi).
|
|
# Vypadek SeaweedFS NESMI shodit parse — pole se proste nedoplni a
|
|
# dozene je seaweed_attachments_backfill_jnj.py.
|
|
if raw:
|
|
try:
|
|
h = hashlib.sha256(raw).hexdigest()
|
|
path, url, _ = sw.store(h, raw, mime)
|
|
entry["sha256"] = h
|
|
entry["seaweed_path"] = path
|
|
entry["seaweed_url"] = url
|
|
except Exception as e:
|
|
logging.warning("SeaweedFS upload selhal (%s): %s", fname, e)
|
|
result.append(entry)
|
|
except Exception as e:
|
|
logging.error("extract_attachments: %s", e)
|
|
return result
|
|
|
|
|
|
def extract_mapi_props(msg) -> dict:
|
|
"""Vsechny raw MAPI properties jako {0xXXXX: value}."""
|
|
result = {}
|
|
try:
|
|
props = msg.props
|
|
if not hasattr(props, "items"):
|
|
return {}
|
|
for key, prop in props.items():
|
|
try:
|
|
val = to_bson(prop.value)
|
|
prop_id = f"0x{key[:4].upper()}" if len(key) >= 4 else f"0x{key.upper()}"
|
|
result[prop_id] = val
|
|
except Exception:
|
|
pass
|
|
except Exception as e:
|
|
logging.error("extract_mapi_props: %s", e)
|
|
return result
|
|
|
|
|
|
# ─── Tolerantni otevirani a raw-OLE fallback ─────────────────────────────────
|
|
_CPID_TO_CODEC = {
|
|
1250: "cp1250", 1251: "cp1251", 1252: "cp1252", 1253: "cp1253",
|
|
1254: "cp1254", 1255: "cp1255", 1256: "cp1256", 1257: "cp1257",
|
|
1258: "cp1258", 874: "cp874", 932: "shift_jis", 936: "gb2312",
|
|
949: "euc_kr", 950: "big5", 65001: "utf-8", 28591: "iso-8859-1",
|
|
28592: "iso-8859-2", 20127: "ascii",
|
|
}
|
|
|
|
|
|
def _read_u32_prop(ole, propid):
|
|
"""Precte 32-bit hodnotu MAPI property z top-level __properties_version1.0."""
|
|
try:
|
|
data = ole.openstream("__properties_version1.0").read()
|
|
except Exception:
|
|
return None
|
|
body = data[32:] # 32-bajtova hlavicka top-level property streamu
|
|
for i in range(0, len(body) - 16 + 1, 16):
|
|
rec = body[i:i + 16]
|
|
tag = struct.unpack("<I", rec[0:4])[0]
|
|
if ((tag >> 16) & 0xFFFF) == propid:
|
|
return struct.unpack("<I", rec[8:12])[0]
|
|
return None
|
|
|
|
|
|
def _detect_cpid(ole) -> Optional[str]:
|
|
"""Codec dle PR_INTERNET_CPID / PR_MESSAGE_CODEPAGE (jako napoveda, ne dogma)."""
|
|
for pid in (0x3FDE, 0x3FFD): # INTERNET_CPID, MESSAGE_CODEPAGE
|
|
codec = _CPID_TO_CODEC.get(_read_u32_prop(ole, pid))
|
|
# utf-8/ascii nejsou dobry hint pro 8-bit stream (casto lzou)
|
|
if codec and codec not in ("utf-8", "ascii"):
|
|
return codec
|
|
return None
|
|
|
|
|
|
def _cascade_decode(raw: bytes, is_unicode: bool, cpid_codec: Optional[str]) -> str:
|
|
"""Dekoduje bajty MAPI stringu. Hlavickam se neveri — zkousime striktne
|
|
v poradi priorit a vezmeme prvni, co projde bez chyby."""
|
|
if not raw:
|
|
return ""
|
|
if is_unicode: # PT_UNICODE = utf-16-le
|
|
try:
|
|
return raw.decode("utf-16-le")
|
|
except Exception:
|
|
return raw.decode("utf-16-le", errors="replace")
|
|
order = ["utf-8"] # utf-8 strict = silny rozlisovac
|
|
if cpid_codec:
|
|
order.append(cpid_codec)
|
|
order += ["cp1250", "cp1252", "gb2312", "big5"]
|
|
for enc in order:
|
|
try:
|
|
return raw.decode(enc, errors="strict")
|
|
except Exception:
|
|
continue
|
|
return raw.decode("latin-1", errors="replace") # nikdy nespadne
|
|
|
|
|
|
def _raw_mapi_strings(msg_path: Path) -> dict:
|
|
"""Cte klicova textova MAPI pole PRIMO z OLE (mimo extract_msg).
|
|
Pouzije se jen kdyz extract_msg vrati degradovane pole."""
|
|
out = {"subject": "", "normalized_subject": "", "sender_name": "",
|
|
"sender_email": "", "sender_smtp": "", "body_text": "", "body_html": ""}
|
|
try:
|
|
ole = olefile.OleFileIO(str(msg_path))
|
|
except Exception:
|
|
return out
|
|
try:
|
|
cpid = _detect_cpid(ole)
|
|
wanted = { # MAPI tag -> klic v out
|
|
"0037": "subject", "0E1D": "normalized_subject",
|
|
"0C1A": "sender_name", "5D01": "sender_smtp",
|
|
"0C1F": "sender_email", "1000": "body_text", "1013": "body_html",
|
|
}
|
|
prefix = "__substg1.0_"
|
|
found = {} # key -> (priorita_typu, hodnota)
|
|
for entry in ole.listdir():
|
|
if len(entry) != 1: # jen top-level (ne vnorene zpravy)
|
|
continue
|
|
name = entry[0]
|
|
if not name.startswith(prefix):
|
|
continue
|
|
tag = name[len(prefix):len(prefix) + 4].upper()
|
|
key = wanted.get(tag)
|
|
if not key:
|
|
continue
|
|
typ = name[-4:].upper()
|
|
prio = {"001F": 3, "001E": 2, "0102": 1}.get(typ, 0)
|
|
if prio == 0:
|
|
continue
|
|
prev = found.get(key)
|
|
if prev and prev[0] >= prio: # preferuj unicode > ansi > binarni
|
|
continue
|
|
try:
|
|
raw = ole.openstream(entry).read()
|
|
val = _cascade_decode(raw, typ == "001F", cpid)
|
|
except Exception:
|
|
continue
|
|
found[key] = (prio, val)
|
|
for key, (_, val) in found.items():
|
|
out[key] = val
|
|
finally:
|
|
ole.close()
|
|
return out
|
|
|
|
|
|
def _degraded(s) -> bool:
|
|
"""Pole je degradovane: prazdne nebo obsahuje U+FFFD (nahradni znak)."""
|
|
return (not s) or ("�" in s)
|
|
|
|
|
|
def open_message(msg_path: Path):
|
|
"""Kaskadove otevreni .msg -> (msg, mode) nebo (None, None)."""
|
|
try:
|
|
return extract_msg.Message(str(msg_path)), "normal"
|
|
except Exception:
|
|
pass
|
|
try:
|
|
return extract_msg.Message(
|
|
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL), "suppress_all"
|
|
except Exception:
|
|
pass
|
|
encs = []
|
|
try:
|
|
ole = olefile.OleFileIO(str(msg_path))
|
|
c = _detect_cpid(ole)
|
|
ole.close()
|
|
if c:
|
|
encs.append(c)
|
|
except Exception:
|
|
pass
|
|
for e in encs + ["cp1250", "cp1252"]:
|
|
try:
|
|
return extract_msg.Message(
|
|
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL,
|
|
overrideEncoding=e), f"override:{e}"
|
|
except Exception:
|
|
continue
|
|
return None, None
|
|
|
|
|
|
def detect_send_failure(*texts):
|
|
"""Vrati (send_failed, send_error) — hleda stopy chyby odeslani v tele.
|
|
Stopy se objevi v neodeslanem .msg (napr. SendAsDenied) az kdyz Outlook
|
|
chybu dopsal a re-upload (jnj_mailbox_sync v1.3) ji prinesl na Tower."""
|
|
blob = "\n".join(t for t in texts if isinstance(t, str))
|
|
if not blob:
|
|
return False, None
|
|
if not any(m in blob for m in SEND_FAIL_MARKERS):
|
|
return False, None
|
|
err = "send failed"
|
|
m = re.search(r"ec=(\d+)", blob)
|
|
if m:
|
|
err = f"SendAsDenied (ec={m.group(1)})"
|
|
m2 = re.search(r"Error is \[([0-9xA-Fa-f\-]+)\]", blob)
|
|
if m2:
|
|
err += f" {m2.group(1)}"
|
|
return True, err
|
|
|
|
|
|
def read_send_account(msg_path: Path) -> str:
|
|
"""Precte PrimarySendAccount (0x0E28) + SentRepresenting email (0x0065)
|
|
PRIMO z OLE — extract_msg tyto stringove props nevraci. Spojene do jednoho
|
|
retezce; pro detekci SendAs (buzalka.cz na uctu its.jnj.com = odmitnuto)."""
|
|
try:
|
|
ole = olefile.OleFileIO(str(msg_path))
|
|
except Exception:
|
|
return ""
|
|
out = []
|
|
try:
|
|
for tag4 in ("0E28", "0065"):
|
|
for t in (tag4 + "001F", tag4 + "001E"):
|
|
name = "__substg1.0_" + t
|
|
if ole.exists(name):
|
|
try:
|
|
raw = ole.openstream(name).read()
|
|
s = (raw.decode("utf-16-le") if t.endswith("001F")
|
|
else raw.decode("cp1250", errors="replace"))
|
|
except Exception:
|
|
s = raw.decode("latin-1", errors="replace")
|
|
s = s.strip()
|
|
if s:
|
|
out.append(s)
|
|
break
|
|
finally:
|
|
ole.close()
|
|
return " ".join(out)
|
|
|
|
|
|
def extract_message(msg_path: Path) -> Optional[dict]:
|
|
"""Parsuje jeden .msg soubor -> MongoDB dokument."""
|
|
msg, parse_mode = open_message(msg_path)
|
|
if msg is None:
|
|
logging.error("open failed [%s]: vsechny pokusy o otevreni selhaly", msg_path.name)
|
|
return None
|
|
|
|
try:
|
|
# ── Message-ID ────────────────────────────────────────────────
|
|
mid = None
|
|
for attr in ("messageId", "message_id", "internetMessageId"):
|
|
mid = safe(msg, attr)
|
|
if mid:
|
|
break
|
|
if not mid:
|
|
mid = f"filename:{msg_path.stem}"
|
|
mid = str(mid).strip()
|
|
|
|
# ── Predmet ───────────────────────────────────────────────────
|
|
try:
|
|
subject = msg.subject or ""
|
|
except Exception:
|
|
subject = ""
|
|
|
|
normalized_subject = safe(msg, "normalizedSubject", "normalized_subject", default="")
|
|
|
|
# ── Telo ──────────────────────────────────────────────────────
|
|
try:
|
|
body_text = msg.body or ""
|
|
except Exception:
|
|
body_text = ""
|
|
|
|
body_html = None
|
|
try:
|
|
bh = msg.htmlBody
|
|
if isinstance(bh, bytes):
|
|
bh = bh.decode("utf-8", errors="replace")
|
|
if bh:
|
|
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
|
|
except Exception:
|
|
pass
|
|
|
|
# ── Odesilatel ────────────────────────────────────────────────
|
|
try:
|
|
sender_email = msg.sender or ""
|
|
except Exception:
|
|
sender_email = ""
|
|
|
|
sender_name = safe(msg, "senderName", "sender_name", default="")
|
|
sender_smtp = safe(msg, "senderSmtpAddress", "sent_representing_smtp_address", default="")
|
|
|
|
# ── Prijemci ──────────────────────────────────────────────────
|
|
recipients = extract_recipients(msg)
|
|
|
|
try:
|
|
to_raw = msg.to or ""
|
|
except Exception:
|
|
to_raw = ""
|
|
try:
|
|
cc_raw = msg.cc or ""
|
|
except Exception:
|
|
cc_raw = ""
|
|
try:
|
|
bcc_raw = getattr(msg, "bcc", None) or ""
|
|
except Exception:
|
|
bcc_raw = ""
|
|
|
|
display_to = safe(msg, "displayTo", "display_to", default="")
|
|
display_cc = safe(msg, "displayCc", "display_cc", default="")
|
|
|
|
# ── Casy ──────────────────────────────────────────────────────
|
|
try:
|
|
received_at = parse_date(msg.date)
|
|
except Exception:
|
|
received_at = None
|
|
|
|
sent_at = None
|
|
for attr in ("clientSubmitTime", "client_submit_time", "sentOn"):
|
|
v = safe(msg, attr)
|
|
if v:
|
|
sent_at = parse_date(v)
|
|
break
|
|
|
|
# ── MAPI vlastnosti ───────────────────────────────────────────
|
|
importance = 1
|
|
try:
|
|
v = msg.importance
|
|
if v is not None:
|
|
importance = int(v)
|
|
except Exception:
|
|
pass
|
|
|
|
sensitivity = 0
|
|
try:
|
|
v = getattr(msg, "sensitivity", None)
|
|
if v is not None:
|
|
sensitivity = int(v)
|
|
except Exception:
|
|
pass
|
|
|
|
flag_status = 0
|
|
try:
|
|
v = safe(msg, "flagStatus", "flag_status")
|
|
if v is not None:
|
|
flag_status = int(v)
|
|
except Exception:
|
|
pass
|
|
|
|
conversation_topic = safe(msg, "conversationTopic", "conversation_topic", default="")
|
|
|
|
conversation_index = ""
|
|
try:
|
|
ci = safe(msg, "conversationIndex", "conversation_index")
|
|
if isinstance(ci, bytes):
|
|
conversation_index = base64.b64encode(ci).decode()
|
|
elif ci:
|
|
conversation_index = str(ci)
|
|
except Exception:
|
|
pass
|
|
|
|
in_reply_to = safe(msg, "inReplyTo", "in_reply_to", default="")
|
|
|
|
internet_refs = []
|
|
try:
|
|
refs = safe(msg, "internetReferences", "internet_references")
|
|
if isinstance(refs, list):
|
|
internet_refs = refs
|
|
elif isinstance(refs, str) and refs:
|
|
internet_refs = [r.strip() for r in refs.split() if r.strip()]
|
|
except Exception:
|
|
pass
|
|
|
|
categories = []
|
|
try:
|
|
cats = safe(msg, "categories")
|
|
if isinstance(cats, list):
|
|
categories = [str(c) for c in cats if c]
|
|
elif isinstance(cats, str) and cats:
|
|
categories = [c.strip() for c in re.split(r"[;,]", cats) if c.strip()]
|
|
except Exception:
|
|
pass
|
|
|
|
read_receipt = bool(safe(msg, "readReceiptRequested", "read_receipt_requested", default=False))
|
|
delivery_receipt = bool(safe(msg, "deliveryReceiptRequested", "delivery_receipt_requested", default=False))
|
|
|
|
# ── Internet headers ──────────────────────────────────────────
|
|
headers = extract_headers(msg)
|
|
|
|
if not in_reply_to:
|
|
in_reply_to = headers.get("in_reply_to", "")
|
|
if not internet_refs:
|
|
refs_str = headers.get("references", "")
|
|
if isinstance(refs_str, str) and refs_str:
|
|
internet_refs = [r.strip() for r in refs_str.split() if r.strip()]
|
|
|
|
# ── Prilohy ───────────────────────────────────────────────────
|
|
attachments = extract_attachments(msg)
|
|
|
|
# ── Raw MAPI ──────────────────────────────────────────────────
|
|
mapi_raw = extract_mapi_props(msg)
|
|
|
|
msg.close()
|
|
|
|
# ── Raw-OLE fallback pro degradovana textova pole ─────────────
|
|
parse_degraded = parse_mode != "normal"
|
|
forced = parse_mode != "normal"
|
|
if (forced or _degraded(subject) or _degraded(body_text)
|
|
or _degraded(sender_email) or (body_html and "�" in body_html)):
|
|
raw = _raw_mapi_strings(msg_path)
|
|
if raw["subject"] and (forced or _degraded(subject)):
|
|
subject = raw["subject"]
|
|
if raw["normalized_subject"] and (forced or _degraded(normalized_subject)):
|
|
normalized_subject = raw["normalized_subject"]
|
|
if raw["body_text"] and (forced or _degraded(body_text)):
|
|
body_text = raw["body_text"]
|
|
if raw["body_html"] and (forced or not body_html or "�" in body_html):
|
|
bh = raw["body_html"]
|
|
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
|
|
if (raw["sender_smtp"] or raw["sender_email"]) and (forced or _degraded(sender_email)):
|
|
sender_email = raw["sender_smtp"] or raw["sender_email"]
|
|
if raw["sender_name"] and (forced or _degraded(sender_name)):
|
|
sender_name = raw["sender_name"]
|
|
if raw["sender_smtp"] and not sender_smtp:
|
|
sender_smtp = raw["sender_smtp"]
|
|
|
|
# ── Detekce neodeslaneho e-mailu (v1.4 telo + v1.5 send-account) ──
|
|
send_failed, send_error = detect_send_failure(body_text, body_html)
|
|
send_account = read_send_account(msg_path) # v1.5
|
|
if send_account and "buzalka.cz" in send_account.lower():
|
|
send_failed = True
|
|
if not send_error:
|
|
send_error = "SendAs buzalka.cz (PrimarySendAccount)"
|
|
|
|
# ── Dokument ──────────────────────────────────────────────────
|
|
return {
|
|
"_id": mid,
|
|
"filename": msg_path.name,
|
|
|
|
"subject": subject,
|
|
"normalized_subject": normalized_subject,
|
|
"importance": importance,
|
|
"sensitivity": sensitivity,
|
|
"flag_status": flag_status,
|
|
"read_receipt_requested": read_receipt,
|
|
"delivery_receipt_requested": delivery_receipt,
|
|
"has_attachments": len(attachments) > 0,
|
|
"attachment_count": len(attachments),
|
|
"message_size_bytes": msg_path.stat().st_size,
|
|
|
|
"conversation_topic": conversation_topic,
|
|
"conversation_index": conversation_index,
|
|
"in_reply_to": in_reply_to,
|
|
"internet_references": internet_refs,
|
|
"categories": categories,
|
|
|
|
"received_at": received_at,
|
|
"sent_at": sent_at,
|
|
|
|
"sender": {
|
|
"email": sender_email,
|
|
"name": sender_name,
|
|
"smtp": sender_smtp,
|
|
},
|
|
"to": to_raw,
|
|
"cc": cc_raw,
|
|
"bcc": bcc_raw,
|
|
"display_to": display_to,
|
|
"display_cc": display_cc,
|
|
"recipients": recipients,
|
|
|
|
"body_text": body_text,
|
|
"body_html": body_html,
|
|
|
|
"attachments": attachments,
|
|
"headers": headers,
|
|
"mapi": mapi_raw,
|
|
|
|
"parse_mode": parse_mode,
|
|
"parse_degraded": parse_degraded,
|
|
"send_failed": send_failed,
|
|
"send_error": send_error,
|
|
"send_account": send_account,
|
|
|
|
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
|
|
# priznak ze prilohy (pokud nejake) jsou v SeaweedFS — pro backfill
|
|
"seaweed_synced_at": (datetime.now(timezone.utc).replace(tzinfo=None)
|
|
if any(a.get("seaweed_path") for a in attachments)
|
|
else None),
|
|
}
|
|
|
|
except Exception as e:
|
|
logging.error("extract_message failed [%s]: %s", msg_path.name, e)
|
|
return None
|
|
|
|
|
|
def create_indexes(col):
|
|
print(" Vytvarim indexy...")
|
|
col.create_index([("received_at", ASCENDING)])
|
|
col.create_index([("sent_at", ASCENDING)])
|
|
col.create_index([("sender.email", ASCENDING)])
|
|
col.create_index([("filename", ASCENDING)], unique=True, sparse=True)
|
|
col.create_index([("conversation_topic", ASCENDING)])
|
|
col.create_index([("has_attachments", ASCENDING)])
|
|
col.create_index([("categories", ASCENDING)])
|
|
col.create_index([("importance", ASCENDING)])
|
|
col.create_index([("flag_status", ASCENDING)])
|
|
col.create_index([
|
|
("subject", TEXT),
|
|
("body_text", TEXT),
|
|
("to", TEXT),
|
|
("cc", TEXT),
|
|
], name="text_search", default_language="none")
|
|
print(" Indexy hotovy.")
|
|
|
|
|
|
def run_parse(col, state_col, args, now) -> dict:
|
|
"""FAZE 1: inkrementalni parse .msg -> emaily. Vraci statistiku."""
|
|
stats = {"mode": None, "total_files": 0, "candidates": 0, "ok": 0, "err": 0}
|
|
print("\n=== FAZE 1: PARSE (.msg -> emaily) ===")
|
|
|
|
all_files = sorted(MSGS_DIR.glob("*.msg"))
|
|
stats["total_files"] = len(all_files)
|
|
if not all_files:
|
|
print(" Zadne .msg ve zdroji -> preskakuji.")
|
|
return stats
|
|
max_mtime = max(f.stat().st_mtime for f in all_files)
|
|
|
|
ps = state_col.find_one({"_id": "parse_state"}) or {}
|
|
last_mtime = ps.get("last_parse_mtime")
|
|
|
|
if args.full:
|
|
candidates = all_files
|
|
mode = "full"
|
|
elif last_mtime is None:
|
|
print(" Prvni beh (zadny mtime watermark) -> seed dle filename v Mongu...")
|
|
existing = set(col.distinct("filename"))
|
|
candidates = [f for f in all_files if f.name not in existing]
|
|
mode = "seed"
|
|
print(f" V Mongu jiz {len(existing)} filename; nove k naparsovani: {len(candidates)}")
|
|
else:
|
|
candidates = [f for f in all_files if f.stat().st_mtime > last_mtime]
|
|
mode = "incremental"
|
|
if args.limit:
|
|
candidates = candidates[:args.limit]
|
|
|
|
stats["mode"] = mode
|
|
stats["candidates"] = len(candidates)
|
|
wm_str = datetime.fromtimestamp(last_mtime).strftime("%Y-%m-%d %H:%M:%S") if last_mtime else "(zadny)"
|
|
print(f" Rezim: {mode} | .msg celkem {len(all_files)} | watermark {wm_str} | ke zpracovani {len(candidates)}")
|
|
|
|
if not candidates:
|
|
print(" Nic noveho k parsovani.")
|
|
# I tak posun watermark na nejnovejsi soubor (krome --full a dry-run)
|
|
if not args.dry_run and mode != "full":
|
|
state_col.update_one({"_id": "parse_state"},
|
|
{"$set": {"last_parse_mtime": max_mtime, "last_parse_at": now}}, upsert=True)
|
|
return stats
|
|
|
|
if args.dry_run:
|
|
print(f" DRY-RUN: naparsoval bych {len(candidates)} souboru (Mongo se nemeni). Ukazka:")
|
|
for f in candidates[:10]:
|
|
mt = datetime.fromtimestamp(f.stat().st_mtime).strftime("%Y-%m-%d %H:%M:%S")
|
|
print(f" + {f.name} (mtime {mt})")
|
|
if len(candidates) > 10:
|
|
print(f" ... a dalsich {len(candidates) - 10}")
|
|
return stats
|
|
|
|
batch = []
|
|
verbose = len(candidates) <= 30
|
|
|
|
def flush():
|
|
if not batch:
|
|
return
|
|
try:
|
|
col.bulk_write(batch, ordered=False)
|
|
except Exception as e:
|
|
logging.error("bulk_write spadl (%s) -- prepinam na per-dokument", e)
|
|
print(f" CHYBA bulk_write: {e} -- zkousim per-dokument")
|
|
for op in batch:
|
|
try:
|
|
col.bulk_write([op], ordered=False)
|
|
except Exception as e2:
|
|
try:
|
|
bad_id = getattr(op, "_filter", {}).get("_id", "?")
|
|
except Exception:
|
|
bad_id = "?"
|
|
logging.error("per-dokument selhal [_id=%s]: %s", bad_id, e2)
|
|
print(f" ZAHOZEN _id={bad_id}: {e2}")
|
|
stats["ok"] -= 1
|
|
stats["err"] += 1
|
|
batch.clear()
|
|
|
|
for i, msg_path in enumerate(candidates, 1):
|
|
doc = extract_message(msg_path)
|
|
if doc is None:
|
|
stats["err"] += 1
|
|
else:
|
|
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
|
|
stats["ok"] += 1
|
|
if len(batch) >= BATCH_SIZE:
|
|
flush()
|
|
if verbose:
|
|
status = "ERR " if doc is None else "OK "
|
|
subj = (doc.get("subject") or "")[:60] if doc else "?"
|
|
print(f" {i:>5}/{len(candidates)} {status} {subj}")
|
|
elif i % 500 == 0:
|
|
print(f" prubeh {i}/{len(candidates)} ok={stats['ok']} err={stats['err']}")
|
|
flush()
|
|
|
|
# Indexy jen pri full/seed/--reindex (v inkrementalnim behu uz existuji)
|
|
if mode in ("full", "seed") or args.reindex:
|
|
create_indexes(col)
|
|
|
|
# Posun watermark na nejnovejsi soubor
|
|
state_col.update_one({"_id": "parse_state"},
|
|
{"$set": {"last_parse_mtime": max_mtime, "last_parse_at": now,
|
|
"last_parsed_count": stats["ok"], "last_parse_mode": mode}},
|
|
upsert=True)
|
|
print(f" PARSE hotovo: ok={stats['ok']} err={stats['err']} "
|
|
f"watermark={datetime.fromtimestamp(max_mtime):%Y-%m-%d %H:%M:%S}")
|
|
return stats
|
|
|
|
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
# FAZE 2 — SYNC (SQLite -> Mongo jnj_messages + emaily cesta)
|
|
# [drive sync_jnj_state_v1.0.py]
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
|
|
def norm_mid(s: str) -> str:
|
|
return (s or "").strip().strip("<>").strip()
|
|
|
|
|
|
def coalesce_path(jnjf, fld) -> str:
|
|
return jnjf if (jnjf and jnjf.strip()) else (fld or "")
|
|
|
|
|
|
def newest_db():
|
|
cands = glob.glob(os.path.join(DB_DIR, "jnjemails_*.db")) or glob.glob(os.path.join(DB_DIR, "*.db"))
|
|
return max(cands, key=os.path.getmtime) if cands else None
|
|
|
|
|
|
def run_sync(db, args, now) -> dict:
|
|
"""FAZE 2: SQLite -> jnj_messages (zrcadlo) + emaily (cesta/stav)."""
|
|
stats = {"total": 0, "matched": 0, "skipped": False}
|
|
print("\n=== FAZE 2: SYNC (SQLite -> jnj_messages + emaily cesta) ===")
|
|
|
|
emails = db[EMAILS_COL]
|
|
state_col = db[STATE_COL]
|
|
|
|
db_path = newest_db()
|
|
if not db_path:
|
|
print(f" Zadna .db v {DB_DIR} -> preskakuji.")
|
|
stats["skipped"] = True
|
|
return stats
|
|
db_name = os.path.basename(db_path)
|
|
print(f" SQLite: {db_name}")
|
|
|
|
st = state_col.find_one({"_id": "watermark"}) or {}
|
|
|
|
# ── Zkratka: tuto DB uz jsme zpracovali? (jen inkrementalni rezim) ─────
|
|
if not args.full and not args.force and st.get("last_db") == db_name:
|
|
print(f" DB {db_name} uz byla zpracovana (last_db) -> nic na praci.")
|
|
stats["skipped"] = True
|
|
return stats
|
|
|
|
wm = None if args.full else st.get("last_updated_at")
|
|
print(f" Watermark: {wm or '(zadny -> vse)'}")
|
|
|
|
# ── SQLite (read-only) ────────────────────────────────────────────────
|
|
con = sqlite3.connect(f"file:{db_path}?mode=ro", uri=True)
|
|
con.row_factory = sqlite3.Row
|
|
available = {row[1] for row in con.execute("PRAGMA table_info(messages)")}
|
|
sel_cols = [c for c in ROW_COLS if c in available]
|
|
missing = [c for c in ROW_COLS if c not in available]
|
|
if missing:
|
|
print(f" (DB nema sloupce: {', '.join(missing)} -> default None/0)")
|
|
has_updated = "updated_at" in available
|
|
|
|
# ── NULL-safe vyber radku ─────────────────────────────────────────────
|
|
# Stary inbox_full_sync zapisuje radky s updated_at=NULL; cisty watermark
|
|
# filtr "updated_at > wm" je v SQL TISE zahazuje (NULL > x = false).
|
|
# Bereme proto i radky s updated_at IS NULL, ktere jeste NEJSOU v zrcadle
|
|
# jnj_messages (aby se zpracovaly prave jednou). --full bere vse.
|
|
mirrored_ids = set()
|
|
if not args.full:
|
|
mirrored_ids = {d["_id"] for d in db[MIRROR_COL].find({}, {"_id": 1})}
|
|
|
|
q = f"SELECT {', '.join(sel_cols)} FROM messages"
|
|
params = ()
|
|
if not args.full and wm and has_updated:
|
|
q += " WHERE updated_at > ? OR updated_at IS NULL"
|
|
params = (wm,)
|
|
elif not args.full and wm and not has_updated:
|
|
print(" (DB nema updated_at -> watermark ignorovan, beru vse)")
|
|
wm = None
|
|
raw_rows = con.execute(q, params).fetchall()
|
|
con.close()
|
|
|
|
rows = []
|
|
skipped_null = 0
|
|
for row in raw_rows:
|
|
d = dict(row)
|
|
if (not args.full) and d.get("updated_at") is None and d.get("message_id") in mirrored_ids:
|
|
skipped_null += 1 # NULL radek uz zrcadleny -> hotovo, nepocitame znovu
|
|
continue
|
|
rows.append(d)
|
|
if skipped_null:
|
|
print(f" (NULL-safe: preskoceno {skipped_null} NULL-updated_at radku uz v jnj_messages)")
|
|
if args.limit:
|
|
rows = rows[:args.limit]
|
|
total = len(rows)
|
|
stats["total"] = total
|
|
print(f" Radku ke zpracovani: {total}")
|
|
if total == 0:
|
|
print(" Neni co synchronizovat (zadne nove radky).")
|
|
if not args.dry_run:
|
|
state_col.update_one({"_id": "watermark"},
|
|
{"$set": {"last_db": db_name, "synced_at": now}}, upsert=True)
|
|
return stats
|
|
|
|
# ── Indexy z Monga ────────────────────────────────────────────────────
|
|
print(" Nacitam _id + filename + jnj_folder z Mongo...")
|
|
ids_exact = set()
|
|
ids_norm = {}
|
|
fnames = {}
|
|
has_path = set()
|
|
for d in emails.find({}, {"_id": 1, "filename": 1, "jnj_folder": 1}):
|
|
_id = d["_id"]
|
|
ids_exact.add(_id)
|
|
ids_norm.setdefault(norm_mid(_id), _id)
|
|
fn = d.get("filename")
|
|
if fn:
|
|
fnames[fn] = _id
|
|
if d.get("jnj_folder"):
|
|
has_path.add(_id)
|
|
print(f" Mongo dokumentu v {EMAILS_COL}: {len(ids_exact)} (z toho s jnj_folder: {len(has_path)})")
|
|
|
|
# ── Plan ──────────────────────────────────────────────────────────────
|
|
m_exact = m_norm = m_fname = unmatched = 0
|
|
examples = []
|
|
mirror_ops = []
|
|
emaily_ops = []
|
|
max_wm = wm or ""
|
|
|
|
for r in rows:
|
|
mid = r.get("message_id")
|
|
uv = r.get("updated_at")
|
|
if uv and uv > max_wm:
|
|
max_wm = uv
|
|
|
|
# Krok A — zrcadlo (vzdy)
|
|
doc = {k: r.get(k) for k in ROW_COLS}
|
|
doc["mirrored_at"] = now
|
|
mirror_ops.append(UpdateOne({"_id": mid}, {"$set": doc}, upsert=True))
|
|
|
|
# Krok B — match do emaily
|
|
target = None
|
|
if mid in ids_exact:
|
|
target = mid; m_exact += 1
|
|
elif norm_mid(mid) in ids_norm:
|
|
target = ids_norm[norm_mid(mid)]; m_norm += 1
|
|
else:
|
|
eid = r.get("entry_id")
|
|
fn = (eid[-20:] + ".msg") if eid else None
|
|
if fn and fn in fnames:
|
|
target = fnames[fn]; m_fname += 1
|
|
else:
|
|
unmatched += 1
|
|
if len(examples) < 6:
|
|
examples.append(mid)
|
|
|
|
if target is not None:
|
|
setdoc = {
|
|
"jnj_folder": coalesce_path(r.get("jnj_folder"), r.get("folder")),
|
|
"jnj_is_read": bool(r.get("is_read")),
|
|
"jnj_not_in_mailbox": bool(r.get("not_in_mailbox_anymore")),
|
|
"jnj_left_mailbox_at": r.get("left_mailbox_at"),
|
|
"jnj_folder_synced_at": now,
|
|
}
|
|
emaily_ops.append(UpdateOne({"_id": target}, {"$set": setdoc}))
|
|
|
|
matched = m_exact + m_norm + m_fname
|
|
stats["matched"] = matched
|
|
print(" --- PLAN ---")
|
|
print(f" Zrcadlo -> {MIRROR_COL}: {len(mirror_ops)} upsert")
|
|
print(f" Emaily match exact (_id): {m_exact}")
|
|
print(f" Emaily match norm (<>): {m_norm}")
|
|
print(f" Emaily match filename: {m_fname}")
|
|
print(f" Emaily match CELKEM: {matched}/{total} ({100.0*matched/total:.1f}%)")
|
|
print(f" NEnamatchovano: {unmatched}")
|
|
if examples:
|
|
print(" Priklady nenamatchovanych message_id:")
|
|
for e in examples:
|
|
print(f" {str(e)[:72]}")
|
|
|
|
# ── Zapis ─────────────────────────────────────────────────────────────
|
|
if args.dry_run:
|
|
print(" DRY-RUN: Mongo se NEMENI.")
|
|
return stats
|
|
|
|
print(" Zapisuji...")
|
|
if mirror_ops:
|
|
db[MIRROR_COL].bulk_write(mirror_ops, ordered=False)
|
|
if emaily_ops:
|
|
emails.bulk_write(emaily_ops, ordered=False)
|
|
state_col.update_one(
|
|
{"_id": "watermark"},
|
|
{"$set": {"last_updated_at": max_wm, "synced_at": now, "last_db": db_name,
|
|
"last_total": total, "last_matched": matched}},
|
|
upsert=True,
|
|
)
|
|
print(f" SYNC hotovo: zrcadlo={len(mirror_ops)} emaily={len(emaily_ops)} watermark={max_wm}")
|
|
return stats
|
|
|
|
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
# FAZE 3 — ENRICH (Mongo -> PG fulltext, deleguje na sdileny 5_enrich)
|
|
# [drive jnj_emails_to_fulltext_v1.0.py]
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
|
|
def newest_enrich():
|
|
"""Najde nejnovejsi /scripts/5_enrich_fulltext_emails_v*.py podle verze vX.Y."""
|
|
cands = glob.glob(ENRICH_GLOB)
|
|
if not cands:
|
|
return None
|
|
|
|
def ver(p):
|
|
m = re.search(r"_v(\d+)\.(\d+)", os.path.basename(p))
|
|
return (int(m.group(1)), int(m.group(2))) if m else (0, 0)
|
|
|
|
return max(cands, key=ver)
|
|
|
|
|
|
def run_enrich(args, new_docs, force) -> dict:
|
|
"""FAZE 3: doindexuje JNJ schranku do PG fulltextu pres sdileny enrich.
|
|
Spousti se jen kdyz parse pridal nove dokumenty (nebo force/enrich-only)."""
|
|
stats = {"ran": False, "rc": None, "skipped_reason": None}
|
|
print("\n=== FAZE 3: ENRICH (PG fulltext) ===")
|
|
|
|
if args.no_enrich:
|
|
stats["skipped_reason"] = "--no-enrich"
|
|
print(" Preskoceno [--no-enrich].")
|
|
return stats
|
|
if args.dry_run:
|
|
enrich = newest_enrich()
|
|
stats["skipped_reason"] = "dry-run"
|
|
print(f" DRY-RUN: zavolal bych {enrich or '(enrich nenalezen!)'} --mailbox {EMAILS_COL}"
|
|
f" (nove doc z parse: {new_docs}, force={force})")
|
|
return stats
|
|
if not force and new_docs <= 0:
|
|
stats["skipped_reason"] = "zadne nove doc"
|
|
print(" Zadne nove maily z parse -> enrich preskocen "
|
|
"(JNJ stejne enrichuje pipeline v 6:00/18:00; --enrich-always vynuti).")
|
|
return stats
|
|
|
|
enrich = newest_enrich()
|
|
if not enrich:
|
|
stats["skipped_reason"] = "enrich skript nenalezen"
|
|
print(f" CHYBA: zadny enrich skript ({ENRICH_GLOB}) -> preskakuji.")
|
|
return stats
|
|
|
|
cmd = [sys.executable, enrich, "--mailbox", EMAILS_COL]
|
|
print(f" Spoustim: {' '.join(cmd)}")
|
|
sys.stdout.flush()
|
|
r = subprocess.run(cmd)
|
|
stats["ran"] = True
|
|
stats["rc"] = r.returncode
|
|
print(f" ENRICH hotovo: exit code {r.returncode}")
|
|
return stats
|
|
|
|
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
# FAZE RECONCILE — smaz provizorni duplikat (no-ID Sent kopie s ID-dvojcetem)
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
|
|
_EMAIL_RE = re.compile(r"[A-Za-z0-9._%+\-]+@[A-Za-z0-9.\-]+\.[A-Za-z]{2,}")
|
|
|
|
|
|
def _to_emails(s):
|
|
return frozenset(e.lower() for e in _EMAIL_RE.findall(s or ""))
|
|
|
|
|
|
def _subj_key(d):
|
|
return (d.get("normalized_subject") or d.get("subject") or "").strip().lower()
|
|
|
|
|
|
def _is_provisional_id(_id):
|
|
return isinstance(_id, str) and (_id.startswith("filename:") or _id.startswith("entryid:"))
|
|
|
|
|
|
def run_reconcile(db, args, now):
|
|
"""Smaze provizorni no-ID Sent kopie, ke kterym existuje dvojce s realnym
|
|
Message-ID (stejni 'to' prijemci + stejny subjekt + received_at do 24h).
|
|
Neodeslane (bez dvojcete) ponecha. --dry-run = jen plan, nic nemaze.
|
|
|
|
Match je na STABILNIM obsahu (emailove adresy + normalized_subject + cas),
|
|
NE na EntryID — provizorni a finalni kopie maji ruzny EntryID."""
|
|
stats = {"provisional": 0, "deletable": 0, "deleted": 0, "kept": 0}
|
|
print("\n=== FAZE RECONCILE (smaz provizorni duplikaty Sent bez Message-ID) ===")
|
|
emails = db[EMAILS_COL]
|
|
|
|
# 1) index dvojcat: realne-ID Sent dokumenty -> klic (to_emails, subj) -> [received_at]
|
|
twins = {}
|
|
for d in emails.find(
|
|
{"jnj_folder": {"$regex": "Sent Items"}},
|
|
{"_id": 1, "to": 1, "normalized_subject": 1, "subject": 1, "received_at": 1}):
|
|
if _is_provisional_id(d.get("_id")):
|
|
continue # jako dvojce berem jen dokumenty s realnym Message-ID
|
|
key = (_to_emails(d.get("to")), _subj_key(d))
|
|
if not key[0] or not key[1]:
|
|
continue
|
|
twins.setdefault(key, []).append(d.get("received_at"))
|
|
|
|
# 2) projdi provizorni a najdi dvojce v casovem okne 24h
|
|
WINDOW = 24 * 3600
|
|
to_delete = []
|
|
examples_keep = []
|
|
for p in emails.find(
|
|
{"jnj_folder": {"$regex": "Sent Items"},
|
|
"_id": {"$regex": "^(filename:|entryid:)"}},
|
|
{"_id": 1, "to": 1, "normalized_subject": 1, "subject": 1,
|
|
"received_at": 1, "send_failed": 1}):
|
|
stats["provisional"] += 1
|
|
key = (_to_emails(p.get("to")), _subj_key(p))
|
|
pr = p.get("received_at")
|
|
matched = False
|
|
if key[0] and key[1] and key in twins and pr is not None:
|
|
for tr in twins[key]:
|
|
if tr is None:
|
|
continue
|
|
try:
|
|
if abs((tr - pr).total_seconds()) <= WINDOW:
|
|
matched = True
|
|
break
|
|
except Exception:
|
|
continue
|
|
if matched:
|
|
stats["deletable"] += 1
|
|
to_delete.append((p["_id"], p.get("to")))
|
|
else:
|
|
stats["kept"] += 1
|
|
if p.get("send_failed") and len(examples_keep) < 8:
|
|
examples_keep.append(p.get("to"))
|
|
|
|
print(f" Provizornich (Sent bez Message-ID): {stats['provisional']}")
|
|
print(f" S nalezenym ID-dvojcetem (smazat): {stats['deletable']}")
|
|
print(f" Bez dvojcete (ponechat): {stats['kept']}")
|
|
if examples_keep:
|
|
print(" Priklady ponechanych s priznakem NEODESLANO:")
|
|
for to in examples_keep:
|
|
print(f" NEODESLANO | {to}")
|
|
|
|
if not to_delete:
|
|
print(" Nic ke smazani.")
|
|
return stats
|
|
|
|
if args.dry_run:
|
|
print(" DRY-RUN: NIC se nemaze. Ukazka kandidatu na smazani:")
|
|
for _id, to in to_delete[:15]:
|
|
print(f" - {_id} ({to})")
|
|
if len(to_delete) > 15:
|
|
print(f" ... a dalsich {len(to_delete) - 15}")
|
|
return stats
|
|
|
|
ids = [x[0] for x in to_delete]
|
|
res = emails.delete_many({"_id": {"$in": ids}})
|
|
stats["deleted"] = res.deleted_count
|
|
print(f" SMAZANO provizornich duplikatu: {stats['deleted']}")
|
|
return stats
|
|
|
|
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
# MAIN
|
|
# ══════════════════════════════════════════════════════════════════════════════
|
|
|
|
def main():
|
|
ap = argparse.ArgumentParser(description=f"jnj_tower_ingest v{SCRIPT_VERSION}")
|
|
ap.add_argument("--dry-run", action="store_true", help="nic nezapise, jen plan")
|
|
ap.add_argument("--full", action="store_true",
|
|
help="parse: reparsuj vse; sync: ignoruj watermark")
|
|
ap.add_argument("--limit", type=int, default=0, help="max N souboru/radku (test)")
|
|
ap.add_argument("--reindex", action="store_true", help="vynut indexy po parse")
|
|
ap.add_argument("--force", action="store_true",
|
|
help="sync: ignoruj last_db zkratku")
|
|
ap.add_argument("--parse-only", action="store_true", help="jen faze PARSE")
|
|
ap.add_argument("--sync-only", action="store_true", help="jen faze SYNC")
|
|
ap.add_argument("--enrich-only", action="store_true", help="jen faze ENRICH")
|
|
ap.add_argument("--no-enrich", action="store_true", help="preskoc fazi ENRICH")
|
|
ap.add_argument("--enrich-always", action="store_true",
|
|
help="spust enrich i bez novych dokumentu z parse")
|
|
ap.add_argument("--reconcile", action="store_true",
|
|
help="spust fazi RECONCILE (smaz provizorni Sent duplikaty; "
|
|
"s --dry-run jen plan)")
|
|
args = ap.parse_args()
|
|
|
|
now = datetime.now(timezone.utc).replace(tzinfo=None)
|
|
|
|
print(f"=== jnj_tower_ingest v{SCRIPT_VERSION} {'[DRY-RUN]' if args.dry_run else ''} ===")
|
|
print(f"Start: {datetime.now():%Y-%m-%d %H:%M:%S}")
|
|
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}")
|
|
|
|
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
|
|
try:
|
|
client.admin.command("ping")
|
|
print(" MongoDB OK")
|
|
except Exception as e:
|
|
print(f"CHYBA: MongoDB nedostupna -- {e}")
|
|
sys.exit(1)
|
|
|
|
db = client[MONGO_DB]
|
|
col = db[EMAILS_COL]
|
|
state_col = db[STATE_COL]
|
|
|
|
p_stats = s_stats = e_stats = r_stats = None
|
|
if not args.sync_only and not args.enrich_only:
|
|
p_stats = run_parse(col, state_col, args, now)
|
|
if not args.parse_only and not args.enrich_only:
|
|
s_stats = run_sync(db, args, now)
|
|
# RECONCILE bezi jen na vyzadani (--reconcile); potrebuje jnj_folder ze sync.
|
|
if args.reconcile and not args.parse_only and not args.enrich_only:
|
|
r_stats = run_reconcile(db, args, now)
|
|
if not args.parse_only and not args.sync_only:
|
|
new_docs = p_stats["ok"] if p_stats else 0
|
|
force = args.enrich_only or args.enrich_always or args.full
|
|
e_stats = run_enrich(args, new_docs, force)
|
|
|
|
# ── Souhrn ────────────────────────────────────────────────────────────
|
|
print("\n=== SOUHRN ===")
|
|
if p_stats is not None:
|
|
print(f" PARSE: rezim={p_stats['mode']} kandidatu={p_stats['candidates']} "
|
|
f"ok={p_stats['ok']} err={p_stats['err']}")
|
|
if s_stats is not None:
|
|
if s_stats.get("skipped"):
|
|
print(" SYNC: preskoceno (zadna nova DB / uz zpracovana)")
|
|
else:
|
|
print(f" SYNC: radku={s_stats['total']} match={s_stats['matched']}")
|
|
if r_stats is not None:
|
|
akce = "plan" if args.dry_run else f"smazano={r_stats['deleted']}"
|
|
print(f" RECON: provizornich={r_stats['provisional']} "
|
|
f"smazatelnych={r_stats['deletable']} {akce}")
|
|
if e_stats is not None:
|
|
if e_stats.get("ran"):
|
|
print(f" ENRICH: spusten, exit code {e_stats['rc']}")
|
|
else:
|
|
print(f" ENRICH: preskoceno ({e_stats.get('skipped_reason')})")
|
|
print(f"Konec: {datetime.now():%Y-%m-%d %H:%M:%S}")
|
|
client.close()
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|