This commit is contained in:
2026-06-10 11:59:19 +02:00
parent a41f97b86b
commit 7b2f69ad85
275 changed files with 16726 additions and 0 deletions
@@ -0,0 +1,86 @@
# jnj_emails_to_fulltext_v1.0
**Verze:** 1.0
**Datum:** 2026-06-10
**Autor:** vladimir.buzalka
**Umístění:** `/mnt/user/Scripts/Janssen_emails_to_fulltext/` (v kontejneru `/scripts/Janssen_emails_to_fulltext/`)
## Účel
Inkrementální zpracování JNJ e-mailů. Ke schránce `vbuzalka@its.jnj.com` není jiný
přístup než průběžný export `.msg` souborů do `/mnt/JNJEMAILS` (~70 tis. souborů,
nové přibývají denně). Skript nové soubory:
1. **KROK 1 — IMPORT:** naparsuje a uloží do MongoDB `emaily."vbuzalka@its.jnj.com"`
(stejné schéma a `_id` logika jako bulk import `parse_emails_tower_v1.3.py`)
2. **KROK 2 — ENRICH:** fulltext do PostgreSQL `MongoEmaily.emails` — deleguje na
existující `/scripts/5_enrich_fulltext_emails_v1.4.py --mailbox "vbuzalka@its.jnj.com"`,
takže PG schéma, `extractor_version` i skip-logika zůstávají identické s hlavní
Graph pipeline (krok 5 v `0_run_pipeline`). Hlavní pipeline pak tyto záznamy
pouze skipuje (ext_v shodná, ok=true) — žádná dvojí práce.
## Vztah k existujícím skriptům
| Skript | Role |
|--------|------|
| `parse_emails_tower_v1.3.py` | jednorázový bulk import (70k, ~48 h) — zdroj parsovací logiky |
| `1b_parse_emails_graph_delta_v1.0.py` | Graph delta pro schránky buzalka.cz — JNJ záměrně vynechána |
| `5_enrich_fulltext_emails_v1.4.py` | enrich do PG — **tento skript ho volá s `--mailbox`** |
| **`jnj_emails_to_fulltext_v1.0.py`** | **inkrementální .msg → Mongo → PG pro JNJ** |
Po ověření provozu je plán sloučit do hlavní pipeline `0_run_pipeline` jako další krok
(např. „1c: JNJ msg import").
## Inkrementalita (co se přeskakuje)
- soubory zapsané ve `state.json` → klíč `done` `{filename: message_id}`
- pojistka: `distinct("filename")` z Mongo kolekce (state se při 1. běhu sám naplní);
`state.json` zároveň řeší duplicitní Message-ID (2 soubory → 1 dokument), které by
se přes samotný Mongo distinct reimportovaly donekonečna
- soubory mladší než `--min-age` s (default 300) — mohou se ještě zapisovat exportem
- soubory které 3× selhaly (`MAX_FAIL`) — `--retry-failed` je zkusí znovu
- flock (`.lock`) — souběžný start se ukončí bez práce
## Spouštění
```bash
# Náhled bez zápisu (Mongo, PG i state zůstanou nedotčené):
docker exec -it python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py --dry-run --limit 10
# Ostrý inkrementální běh:
docker exec -it python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py
# Wrapper s datovaným logem (pro User Scripts / cron):
/mnt/user/Scripts/Janssen_emails_to_fulltext/run_jnj_emails_to_fulltext.sh
```
| Parametr | Význam |
|----------|--------|
| `--dry-run` | parsuje, ale nic nezapisuje (Mongo, PG, state.json) |
| `--limit N` | max N nových souborů (test) |
| `--min-age S` | přeskoč soubory mladší než S sekund (default 300) |
| `--no-enrich` | jen import do Mongo, bez PG |
| `--retry-failed` | znovu zkusit trvale selhávající soubory |
| `--msgs-dir DIR` | jiný zdrojový adresář (default `/mnt/JNJEMAILS`) |
## Výstupy a logy (vše v adresáři skriptu)
- `logs/run_YYYYMMDD_HHMM.log` — stdout běhu (přes wrapper; `run_latest.log` symlink)
- `logs/errors.log` — chyby parsování jednotlivých souborů
- `state.json``{done: {filename: message_id}, failed: {filename: počet}}`
## Exit kódy
- `0` — OK (včetně „nic nového")
- `1` — chyby parsování / enrich selhal / Mongo nedostupná
## Závislosti
V image `python-runner` už jsou: `extract-msg==0.55.0`, `olefile`, `pymongo`,
`python-dateutil`. KROK 2 navíc potřebuje `psycopg`, `bs4`, `lxml` — používá je
denně pipeline krok 5, takže jsou k dispozici.
## Historie verzí
- **1.0** (2026-06-10) — iniciální verze; parsovací logika 1:1 z
`parse_emails_tower_v1.3.py`, enrich delegován na `5_enrich_fulltext_emails_v1.4.py`
@@ -0,0 +1,890 @@
"""
jnj_emails_to_fulltext_v1.0.py
Nazev: jnj_emails_to_fulltext_v1.0.py
Verze: 1.0
Datum: 2026-06-10
Autor: vladimir.buzalka
Popis:
Inkrementalni pipeline pro JNJ e-maily exportovane jako .msg soubory.
Nove .msg pribyvaji do /mnt/JNJEMAILS (export z Outlooku, jiny pristup
ke schrance vbuzalka@its.jnj.com neni). Skript je dvoukrokovy:
KROK 1 (IMPORT): nove .msg -> MongoDB emaily."vbuzalka@its.jnj.com"
KROK 2 (ENRICH): fulltext -> PostgreSQL MongoEmaily.emails
(vola existujici /scripts/5_enrich_fulltext_emails_v1.4.py
s parametrem --mailbox, takze PG schema i extractor_version
zustavaji 100% konzistentni s hlavni Graph pipeline)
Parsovaci logika KROKU 1 je prevzata 1:1 z parse_emails_tower_v1.3.py:
- kaskadove otevirani (normal -> SUPPRESS_ALL -> overrideEncoding)
- raw-OLE fallback pro degradovana textova pole (kodovani se neveri)
- to_bson ochrana proti >int64 MAPI hodnotam
- stejne schema dokumentu, stejna kolekce, stejny zpusob _id
(Internet Message-ID, fallback "filename:<stem>")
Inkrementalita (co se preskakuje):
- soubory ve stavovem souboru state.json (klic "done")
- + pojistka: distinct("filename") z Mongo kolekce
(state.json se pri prvnim behu z Monga sam naplni)
- soubory mladsi nez --min-age sekund (jeste se mohou zapisovat)
- soubory ktere MAX_FAIL-krat selhaly (--retry-failed je zkusi znovu)
Stavovy soubor state.json resi i edge-case duplicitnich Message-ID
(2 ruzne .msg se stejnym _id by se pres Mongo distinct("filename")
donekonecna stridave reimportovaly).
Prostredi:
Bezi v Docker containeru "python-runner" na Unraid Tower.
/mnt/user/JNJEMAILS -> /mnt/JNJEMAILS (zdrojove .msg)
/mnt/user/Scripts -> /scripts (tento skript + enrich skript)
MongoDB 192.168.1.76:27017 db=emaily kolekce=vbuzalka@its.jnj.com
PostgreSQL 192.168.1.76:5432 db=MongoEmaily tabulka=emails (pres enrich)
Spousteni (z Unraid terminalu):
# Nahled bez zapisu (parsuje, ale nezapisuje do Mongo ani PG):
docker exec -it python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py --dry-run --limit 10
# Ostry inkrementalni beh (import + enrich):
docker exec -it python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py
# Pres wrapper s datovanym logem (pro cron):
/mnt/user/Scripts/Janssen_emails_to_fulltext/run_jnj_emails_to_fulltext.sh
Parametry:
--dry-run parsovani probehne, ale NIC se nezapise (Mongo, PG, state)
--limit N zpracovat max N novych souboru (test)
--min-age S preskoc soubory mladsi nez S sekund (default 300)
--no-enrich preskocit KROK 2 (jen import do Mongo)
--retry-failed zkusit znovu i soubory ktere uz MAX_FAIL-krat selhaly
--msgs-dir DIR jiny zdrojovy adresar (default /mnt/JNJEMAILS)
Vystup / logy (vse v adresari skriptu):
stdout prubeh (wrapper presmeruje do logs/run_*.log)
logs/errors.log chyby parsovani jednotlivych souboru
state.json stav: done={filename: message_id}, failed={filename: pocet}
Exit kody:
0 = OK (vcetne "nic noveho")
1 = chyba (parsovani s chybami / enrich selhal / Mongo nedostupna)
Historie verzi:
1.0 2026-06-10 Inicialni verze (fork parsovaci logiky parse_emails_tower_v1.3,
enrich delegovan na 5_enrich_fulltext_emails_v1.4 --mailbox)
"""
import sys
import os
import re
import json
import time
import logging
import argparse
import base64
import struct
import subprocess
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import extract_msg
from extract_msg.enums import ErrorBehavior
import olefile
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne
try:
import fcntl # jen Linux (v containeru vzdy)
except ImportError: # lokalni vyvoj na Windows
fcntl = None
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
SCRIPT_DIR = Path(__file__).resolve().parent
MSGS_DIR = Path("/mnt/JNJEMAILS")
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL = "vbuzalka@its.jnj.com"
ENRICH_SCRIPT = Path("/scripts/5_enrich_fulltext_emails_v1.4.py")
BATCH_SIZE = 200
MAX_FAIL = 3 # po tolika selhanich soubor preskakovat
DEFAULT_MIN_AGE = 300 # s — mladsi soubory se jeste mohou zapisovat
STATE_FILE = SCRIPT_DIR / "state.json"
LOCK_FILE = SCRIPT_DIR / ".lock"
LOG_DIR = SCRIPT_DIR / "logs"
ERR_LOG = LOG_DIR / "errors.log"
SCRIPT_VERSION = "1.0"
# ──────────────────────────────────────────────────────────────────────────────
LOG_DIR.mkdir(exist_ok=True)
logging.basicConfig(
filename=str(ERR_LOG),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
# ─── Pomocné funkce (1:1 z parse_emails_tower_v1.3) ──────────────────────────
def safe(obj, *attrs, default=None):
"""Bezpecne cteni atributu — vrati prvni non-None hodnotu."""
for attr in attrs:
try:
val = getattr(obj, attr, None)
if val is None:
continue
if isinstance(val, str) and not val.strip():
continue
return val
except Exception:
continue
return default
def parse_date(raw) -> Optional[datetime]:
"""Libovolny datum -> UTC datetime bez tzinfo (pro MongoDB)."""
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
_INT64_MIN, _INT64_MAX = -(2 ** 63), 2 ** 63 - 1
def to_bson(val):
"""Konvertuje hodnotu na BSON-serializovatelny typ.
BSON umi jen signed int64 — velke MAPI hodnoty (PR_CHANGE_KEY, FILETIME)
mimo rozsah prevadime na string, jinak cely bulk_write spadne.
"""
if isinstance(val, bool): # bool PRED int (isinstance(True, int))
return val
if isinstance(val, bytes):
return val.hex() if len(val) <= 128 else f"<bytes:{len(val)}>"
if isinstance(val, datetime):
return parse_date(val)
if isinstance(val, int):
return val if _INT64_MIN <= val <= _INT64_MAX else str(val)
if isinstance(val, (str, float, type(None))):
return val
if isinstance(val, list):
return [to_bson(v) for v in val]
try:
iv = int(val)
return iv if _INT64_MIN <= iv <= _INT64_MAX else str(iv)
except Exception:
pass
return str(val)
def extract_headers(msg) -> dict:
headers = {}
try:
hdr = msg.header
if not hdr:
return {}
from email.header import decode_header as _dh
def _decode(v: str) -> str:
try:
parts = _dh(v)
out = ""
for part, enc in parts:
out += part.decode(enc or "utf-8", errors="replace") if isinstance(part, bytes) else part
return out
except Exception:
return v
for key in set(hdr.keys()):
k = key.lower().replace("-", "_")
vals = [_decode(v) for v in hdr.get_all(key, [])]
headers[k] = vals if len(vals) > 1 else (vals[0] if vals else "")
except Exception as e:
logging.error("extract_headers: %s", e)
return headers
def extract_recipients(msg) -> list:
result = []
type_map = {1: "to", 2: "cc", 3: "bcc"}
try:
for r in msg.recipients:
rtype = getattr(r, "type", 1)
try:
rtype = int(rtype)
except Exception:
try:
rtype = int(rtype.value)
except Exception:
rtype = 1
rec = {
"type": type_map.get(rtype, "to"),
"email": safe(r, "email", default=""),
"name": safe(r, "name", default=""),
}
result.append(rec)
except Exception as e:
logging.error("extract_recipients: %s", e)
return result
def extract_attachments(msg) -> list:
result = []
try:
for att in msg.attachments:
fname = safe(att, "longFilename", "shortFilename", default="")
if not fname:
continue
size = 0
try:
d = att.data
size = len(d) if d else 0
except Exception:
pass
result.append({
"filename": fname,
"size_bytes": size,
"mime_type": safe(att, "mimetype", "mimeType", default="application/octet-stream"),
"content_id": safe(att, "cid", default=None),
"is_inline": bool(safe(att, "isInline", default=False)),
})
except Exception as e:
logging.error("extract_attachments: %s", e)
return result
def extract_mapi_props(msg) -> dict:
"""Vsechny raw MAPI properties jako {0xXXXX: value}."""
result = {}
try:
props = msg.props
if not hasattr(props, "items"):
return {}
for key, prop in props.items():
try:
val = to_bson(prop.value)
prop_id = f"0x{key[:4].upper()}" if len(key) >= 4 else f"0x{key.upper()}"
result[prop_id] = val
except Exception:
pass
except Exception as e:
logging.error("extract_mapi_props: %s", e)
return result
# ─── Tolerantní otevírání a raw-OLE fallback (1:1 z v1.3) ────────────────────
_CPID_TO_CODEC = {
1250: "cp1250", 1251: "cp1251", 1252: "cp1252", 1253: "cp1253",
1254: "cp1254", 1255: "cp1255", 1256: "cp1256", 1257: "cp1257",
1258: "cp1258", 874: "cp874", 932: "shift_jis", 936: "gb2312",
949: "euc_kr", 950: "big5", 65001: "utf-8", 28591: "iso-8859-1",
28592: "iso-8859-2", 20127: "ascii",
}
def _read_u32_prop(ole, propid):
"""Precte 32-bit hodnotu MAPI property z top-level __properties_version1.0."""
try:
data = ole.openstream("__properties_version1.0").read()
except Exception:
return None
body = data[32:]
for i in range(0, len(body) - 16 + 1, 16):
rec = body[i:i + 16]
tag = struct.unpack("<I", rec[0:4])[0]
if ((tag >> 16) & 0xFFFF) == propid:
return struct.unpack("<I", rec[8:12])[0]
return None
def _detect_cpid(ole) -> Optional[str]:
"""Codec dle PR_INTERNET_CPID / PR_MESSAGE_CODEPAGE (napoveda, ne dogma)."""
for pid in (0x3FDE, 0x3FFD):
codec = _CPID_TO_CODEC.get(_read_u32_prop(ole, pid))
if codec and codec not in ("utf-8", "ascii"):
return codec
return None
def _cascade_decode(raw: bytes, is_unicode: bool, cpid_codec: Optional[str]) -> str:
"""Dekoduje bajty MAPI stringu — hlavickam se neveri, kaskada strict pokusu."""
if not raw:
return ""
if is_unicode:
try:
return raw.decode("utf-16-le")
except Exception:
return raw.decode("utf-16-le", errors="replace")
order = ["utf-8"]
if cpid_codec:
order.append(cpid_codec)
order += ["cp1250", "cp1252", "gb2312", "big5"]
for enc in order:
try:
return raw.decode(enc, errors="strict")
except Exception:
continue
return raw.decode("latin-1", errors="replace")
def _raw_mapi_strings(msg_path: Path) -> dict:
"""Cte klicova textova MAPI pole PRIMO z OLE (mimo extract_msg)."""
out = {"subject": "", "normalized_subject": "", "sender_name": "",
"sender_email": "", "sender_smtp": "", "body_text": "", "body_html": ""}
try:
ole = olefile.OleFileIO(str(msg_path))
except Exception:
return out
try:
cpid = _detect_cpid(ole)
wanted = {
"0037": "subject", "0E1D": "normalized_subject",
"0C1A": "sender_name", "5D01": "sender_smtp",
"0C1F": "sender_email", "1000": "body_text", "1013": "body_html",
}
prefix = "__substg1.0_"
found = {}
for entry in ole.listdir():
if len(entry) != 1:
continue
name = entry[0]
if not name.startswith(prefix):
continue
tag = name[len(prefix):len(prefix) + 4].upper()
key = wanted.get(tag)
if not key:
continue
typ = name[-4:].upper()
prio = {"001F": 3, "001E": 2, "0102": 1}.get(typ, 0)
if prio == 0:
continue
prev = found.get(key)
if prev and prev[0] >= prio:
continue
try:
raw = ole.openstream(entry).read()
val = _cascade_decode(raw, typ == "001F", cpid)
except Exception:
continue
found[key] = (prio, val)
for key, (_, val) in found.items():
out[key] = val
finally:
ole.close()
return out
def _degraded(s) -> bool:
"""Pole je degradovane: prazdne nebo obsahuje U+FFFD."""
return (not s) or ("" in s)
def open_message(msg_path: Path):
"""Kaskadove otevreni .msg -> (msg, mode) nebo (None, None)."""
try:
return extract_msg.Message(str(msg_path)), "normal"
except Exception:
pass
try:
return extract_msg.Message(
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL), "suppress_all"
except Exception:
pass
encs = []
try:
ole = olefile.OleFileIO(str(msg_path))
c = _detect_cpid(ole)
ole.close()
if c:
encs.append(c)
except Exception:
pass
for e in encs + ["cp1250", "cp1252"]:
try:
return extract_msg.Message(
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL,
overrideEncoding=e), f"override:{e}"
except Exception:
continue
return None, None
# ─── Hlavní extrakce (1:1 z v1.3) ────────────────────────────────────────────
def extract_message(msg_path: Path) -> Optional[dict]:
"""Parsuje jeden .msg soubor -> MongoDB dokument."""
msg, parse_mode = open_message(msg_path)
if msg is None:
logging.error("open failed [%s]: vsechny pokusy o otevreni selhaly", msg_path.name)
return None
try:
mid = None
for attr in ("messageId", "message_id", "internetMessageId"):
mid = safe(msg, attr)
if mid:
break
if not mid:
mid = f"filename:{msg_path.stem}"
mid = str(mid).strip()
try:
subject = msg.subject or ""
except Exception:
subject = ""
normalized_subject = safe(msg, "normalizedSubject", "normalized_subject", default="")
try:
body_text = msg.body or ""
except Exception:
body_text = ""
body_html = None
try:
bh = msg.htmlBody
if isinstance(bh, bytes):
bh = bh.decode("utf-8", errors="replace")
if bh:
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
except Exception:
pass
try:
sender_email = msg.sender or ""
except Exception:
sender_email = ""
sender_name = safe(msg, "senderName", "sender_name", default="")
sender_smtp = safe(msg, "senderSmtpAddress", "sent_representing_smtp_address", default="")
recipients = extract_recipients(msg)
try:
to_raw = msg.to or ""
except Exception:
to_raw = ""
try:
cc_raw = msg.cc or ""
except Exception:
cc_raw = ""
try:
bcc_raw = getattr(msg, "bcc", None) or ""
except Exception:
bcc_raw = ""
display_to = safe(msg, "displayTo", "display_to", default="")
display_cc = safe(msg, "displayCc", "display_cc", default="")
try:
received_at = parse_date(msg.date)
except Exception:
received_at = None
sent_at = None
for attr in ("clientSubmitTime", "client_submit_time", "sentOn"):
v = safe(msg, attr)
if v:
sent_at = parse_date(v)
break
importance = 1
try:
v = msg.importance
if v is not None:
importance = int(v)
except Exception:
pass
sensitivity = 0
try:
v = getattr(msg, "sensitivity", None)
if v is not None:
sensitivity = int(v)
except Exception:
pass
flag_status = 0
try:
v = safe(msg, "flagStatus", "flag_status")
if v is not None:
flag_status = int(v)
except Exception:
pass
conversation_topic = safe(msg, "conversationTopic", "conversation_topic", default="")
conversation_index = ""
try:
ci = safe(msg, "conversationIndex", "conversation_index")
if isinstance(ci, bytes):
conversation_index = base64.b64encode(ci).decode()
elif ci:
conversation_index = str(ci)
except Exception:
pass
in_reply_to = safe(msg, "inReplyTo", "in_reply_to", default="")
internet_refs = []
try:
refs = safe(msg, "internetReferences", "internet_references")
if isinstance(refs, list):
internet_refs = refs
elif isinstance(refs, str) and refs:
internet_refs = [r.strip() for r in refs.split() if r.strip()]
except Exception:
pass
categories = []
try:
cats = safe(msg, "categories")
if isinstance(cats, list):
categories = [str(c) for c in cats if c]
elif isinstance(cats, str) and cats:
categories = [c.strip() for c in re.split(r"[;,]", cats) if c.strip()]
except Exception:
pass
read_receipt = bool(safe(msg, "readReceiptRequested", "read_receipt_requested", default=False))
delivery_receipt = bool(safe(msg, "deliveryReceiptRequested", "delivery_receipt_requested", default=False))
headers = extract_headers(msg)
if not in_reply_to:
in_reply_to = headers.get("in_reply_to", "")
if not internet_refs:
refs_str = headers.get("references", "")
if isinstance(refs_str, str) and refs_str:
internet_refs = [r.strip() for r in refs_str.split() if r.strip()]
attachments = extract_attachments(msg)
mapi_raw = extract_mapi_props(msg)
msg.close()
# Raw-OLE fallback pro degradovana textova pole
parse_degraded = parse_mode != "normal"
forced = parse_mode != "normal"
if (forced or _degraded(subject) or _degraded(body_text)
or _degraded(sender_email) or (body_html and "" in body_html)):
raw = _raw_mapi_strings(msg_path)
if raw["subject"] and (forced or _degraded(subject)):
subject = raw["subject"]
if raw["normalized_subject"] and (forced or _degraded(normalized_subject)):
normalized_subject = raw["normalized_subject"]
if raw["body_text"] and (forced or _degraded(body_text)):
body_text = raw["body_text"]
if raw["body_html"] and (forced or not body_html or "" in body_html):
bh = raw["body_html"]
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
if (raw["sender_smtp"] or raw["sender_email"]) and (forced or _degraded(sender_email)):
sender_email = raw["sender_smtp"] or raw["sender_email"]
if raw["sender_name"] and (forced or _degraded(sender_name)):
sender_name = raw["sender_name"]
if raw["sender_smtp"] and not sender_smtp:
sender_smtp = raw["sender_smtp"]
return {
"_id": mid,
"filename": msg_path.name,
"subject": subject,
"normalized_subject": normalized_subject,
"importance": importance,
"sensitivity": sensitivity,
"flag_status": flag_status,
"read_receipt_requested": read_receipt,
"delivery_receipt_requested": delivery_receipt,
"has_attachments": len(attachments) > 0,
"attachment_count": len(attachments),
"message_size_bytes": msg_path.stat().st_size,
"conversation_topic": conversation_topic,
"conversation_index": conversation_index,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"categories": categories,
"received_at": received_at,
"sent_at": sent_at,
"sender": {
"email": sender_email,
"name": sender_name,
"smtp": sender_smtp,
},
"to": to_raw,
"cc": cc_raw,
"bcc": bcc_raw,
"display_to": display_to,
"display_cc": display_cc,
"recipients": recipients,
"body_text": body_text,
"body_html": body_html,
"attachments": attachments,
"headers": headers,
"mapi": mapi_raw,
"parse_mode": parse_mode,
"parse_degraded": parse_degraded,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg_path.name, e)
return None
# ─── Stav (state.json) ───────────────────────────────────────────────────────
def load_state() -> dict:
if STATE_FILE.exists():
try:
st = json.loads(STATE_FILE.read_text(encoding="utf-8"))
if isinstance(st, dict) and "done" in st and "failed" in st:
return st
except Exception as e:
print(f" VAROVANI: state.json nesel nacist ({e}) -- zacinam s prazdnym")
return {"done": {}, "failed": {}}
def save_state(state: dict) -> None:
tmp = STATE_FILE.with_suffix(".json.tmp")
tmp.write_text(json.dumps(state, ensure_ascii=False), encoding="utf-8")
os.replace(tmp, STATE_FILE)
# ─── Lock proti soubehu ──────────────────────────────────────────────────────
def acquire_lock():
"""Vrati otevreny lock file handle, nebo ukonci skript pokud uz bezi jiny."""
if fcntl is None:
return None
lf = open(LOCK_FILE, "w")
try:
fcntl.flock(lf, fcntl.LOCK_EX | fcntl.LOCK_NB)
except OSError:
print("Jiny beh jnj_emails_to_fulltext jeste probiha (lock) -- koncim.")
sys.exit(0)
return lf
# ─── KROK 2: enrich (delegovano na existujici skript pipeline) ───────────────
def run_enrich() -> int:
"""Spusti 5_enrich_fulltext_emails --mailbox <JNJ kolekce>. Vraci exit code."""
if not ENRICH_SCRIPT.exists():
print(f" CHYBA: enrich skript nenalezen: {ENRICH_SCRIPT}")
return 1
cmd = [sys.executable, str(ENRICH_SCRIPT), "--mailbox", MONGO_COL]
print(f"\n=== KROK 2: ENRICH (PG fulltext) ===")
print(f" {' '.join(cmd)}")
sys.stdout.flush()
r = subprocess.run(cmd)
print(f" enrich exit code: {r.returncode}")
return r.returncode
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main() -> int:
ap = argparse.ArgumentParser(description=f"jnj_emails_to_fulltext v{SCRIPT_VERSION}")
ap.add_argument("--msgs-dir", default=str(MSGS_DIR),
help="Cesta k .msg souborum")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N novych souboru (0 = vse)")
ap.add_argument("--min-age", type=int, default=DEFAULT_MIN_AGE,
help=f"Preskoc soubory mladsi nez S sekund (default {DEFAULT_MIN_AGE})")
ap.add_argument("--dry-run", action="store_true",
help="Parsuje, ale NEZAPISUJE (Mongo, PG, state.json)")
ap.add_argument("--no-enrich", action="store_true",
help="Preskocit KROK 2 (PG fulltext)")
ap.add_argument("--retry-failed", action="store_true",
help=f"Zkusit znovu i soubory s {MAX_FAIL}+ selhanimi")
args = ap.parse_args()
msgs_dir = Path(args.msgs_dir)
start = datetime.now()
print(f"=== jnj_emails_to_fulltext v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}{' [DRY-RUN]' if args.dry_run else ''}")
print(f"Zdroj: {msgs_dir}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
lock = acquire_lock() # noqa: F841 — drzime handle do konce behu
# MongoDB
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
return 1
col = client[MONGO_DB][MONGO_COL]
# Stav + seed z Mongo (filename je v kazdem dokumentu z tower importu)
state = load_state()
print(" Nacitam seznam jiz importovanych souboru z MongoDB...")
mongo_filenames = set(col.distinct("filename"))
print(f" v Mongo: {len(mongo_filenames)} souboru, ve state.json: {len(state['done'])}")
known = set(state["done"]) | mongo_filenames
failed_skip = set()
if not args.retry_failed:
failed_skip = {fn for fn, cnt in state["failed"].items() if cnt >= MAX_FAIL}
# Scan
print(f"\nSkenuji {msgs_dir} ...")
all_files = sorted(msgs_dir.glob("*.msg"))
now_ts = time.time()
too_young = 0
candidates = []
for f in all_files:
if f.name in known or f.name in failed_skip:
continue
try:
if now_ts - f.stat().st_mtime < args.min_age:
too_young += 1
continue
except OSError:
continue
candidates.append(f)
if args.limit:
candidates = candidates[:args.limit]
total = len(candidates)
print(f" Celkem .msg na disku: {len(all_files)}")
print(f" Jiz importovano (skip): {len(known & {x.name for x in all_files})}")
print(f" Trvale selhavajici: {len(failed_skip)}")
print(f" Mladsi nez {args.min_age}s: {too_young}")
print(f" Ke zpracovani: {total}{' (limit)' if args.limit else ''}\n")
imported = 0
err_count = 0
if total == 0:
print("Nic noveho k importu.")
else:
batch: list[tuple[str, str, UpdateOne]] = [] # (filename, _id, op)
def flush():
nonlocal imported, err_count
if not batch:
return
if args.dry_run:
batch.clear()
return
ops = [op for _, _, op in batch]
try:
col.bulk_write(ops, ordered=False)
for fn, mid, _ in batch:
state["done"][fn] = mid
state["failed"].pop(fn, None)
except Exception as e:
# Per-dokument fallback — chyba zahodi jen vadny zaznam
logging.error("bulk_write spadl (%s) -- prepinam na per-dokument", e)
print(f" CHYBA bulk_write: {e} -- zkousim per-dokument")
for fn, mid, op in batch:
try:
col.bulk_write([op], ordered=False)
state["done"][fn] = mid
state["failed"].pop(fn, None)
except Exception as e2:
logging.error("per-dokument selhal [%s, _id=%s]: %s", fn, mid, e2)
print(f" ZAHOZEN {fn} (_id={mid}): {e2}")
state["failed"][fn] = state["failed"].get(fn, 0) + 1
imported -= 1
err_count += 1
save_state(state)
batch.clear()
for i, msg_path in enumerate(candidates, 1):
doc = extract_message(msg_path)
if doc is None:
err_count += 1
if not args.dry_run:
state["failed"][msg_path.name] = state["failed"].get(msg_path.name, 0) + 1
else:
batch.append((msg_path.name, doc["_id"],
UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True)))
imported += 1
if len(batch) >= BATCH_SIZE:
flush()
status = "ERR " if doc is None else "OK "
subject_str = (doc.get("subject") or "")[:60] if doc else "?"
sender_str = (doc.get("sender", {}).get("email") or "")[:40] if doc else "?"
print(f" {i:>5}/{total} {status} {subject_str:<60} {sender_str}")
if i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = i / elapsed if elapsed > 0 else 0
eta_s = int((total - i) / rate) if rate > 0 else 0
print(f" {'-'*80}")
print(f" Prubeh: ok={imported} err={err_count} "
f"{rate:.1f} msg/s ETA {eta_s//3600}h{(eta_s%3600)//60}m")
print(f" {'-'*80}")
flush()
if not args.dry_run:
save_state(state)
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"KROK 1 (import): ok={imported} err={err_count}"
f"{' [DRY-RUN — nic nezapsano]' if args.dry_run else ''}")
print(f"Cas importu: {int(elapsed_total//60)}m {int(elapsed_total%60)}s")
if not args.dry_run:
print(f"Dokumentu v kolekci: {col.estimated_document_count()}")
client.close()
# KROK 2 — enrich do PG (jen pokud je co a neni dry-run)
enrich_rc = 0
if args.dry_run:
print("\nKROK 2 (enrich): preskocen [DRY-RUN]")
elif args.no_enrich:
print("\nKROK 2 (enrich): preskocen [--no-enrich]")
elif imported == 0:
print("\nKROK 2 (enrich): preskocen (zadne nove e-maily)")
else:
enrich_rc = run_enrich()
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby parsovani logovany do: {ERR_LOG}")
return 1 if (err_count > 0 or enrich_rc != 0) else 0
if __name__ == "__main__":
try:
raise SystemExit(main())
except KeyboardInterrupt:
print("\nPreruseno uzivatelem")
sys.exit(130)
@@ -0,0 +1,40 @@
#!/bin/bash
# ============================================================================
# Wrapper pro jnj_emails_to_fulltext_v1.0.py (JNJ .msg -> Mongo -> PG fulltext).
# Bezi na Unraid hostu, skript spousti uvnitr containeru python-runner.
# Datovane logy + run_latest.log symlink, uklid logu po 30 dnech.
#
# Instalace pres User Scripts plugin nebo /etc/cron.d (zatim NEINSTALOVANO):
# 30 6,18 * * * /mnt/user/Scripts/Janssen_emails_to_fulltext/run_jnj_emails_to_fulltext.sh
# ============================================================================
set -u
BASE_DIR="/mnt/user/Scripts/Janssen_emails_to_fulltext"
LOG_DIR="${BASE_DIR}/logs"
TIMESTAMP=$(date +%Y%m%d_%H%M)
LOG_FILE="${LOG_DIR}/run_${TIMESTAMP}.log"
LATEST_LINK="${LOG_DIR}/run_latest.log"
RETENTION_DAYS=30
mkdir -p "$LOG_DIR"
echo "=== jnj_emails_to_fulltext run @ $(date '+%Y-%m-%d %H:%M:%S') ===" >> "$LOG_FILE"
if ! docker inspect -f '{{.State.Running}}' python-runner 2>/dev/null | grep -q true; then
echo "ERROR: python-runner container is not running" >> "$LOG_FILE"
docker start python-runner >> "$LOG_FILE" 2>&1 || exit 1
sleep 5
fi
docker exec python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py "$@" >> "$LOG_FILE" 2>&1
RET=$?
echo "" >> "$LOG_FILE"
echo "=== Wrapper finished @ $(date '+%Y-%m-%d %H:%M:%S') exit=$RET ===" >> "$LOG_FILE"
ln -sf "$LOG_FILE" "$LATEST_LINK"
find "$LOG_DIR" -name 'run_*.log' -type f -mtime +${RETENTION_DAYS} -delete
exit $RET