z230
This commit is contained in:
@@ -0,0 +1,86 @@
|
||||
# jnj_emails_to_fulltext_v1.0
|
||||
|
||||
**Verze:** 1.0
|
||||
**Datum:** 2026-06-10
|
||||
**Autor:** vladimir.buzalka
|
||||
**Umístění:** `/mnt/user/Scripts/Janssen_emails_to_fulltext/` (v kontejneru `/scripts/Janssen_emails_to_fulltext/`)
|
||||
|
||||
## Účel
|
||||
|
||||
Inkrementální zpracování JNJ e-mailů. Ke schránce `vbuzalka@its.jnj.com` není jiný
|
||||
přístup než průběžný export `.msg` souborů do `/mnt/JNJEMAILS` (~70 tis. souborů,
|
||||
nové přibývají denně). Skript nové soubory:
|
||||
|
||||
1. **KROK 1 — IMPORT:** naparsuje a uloží do MongoDB `emaily."vbuzalka@its.jnj.com"`
|
||||
(stejné schéma a `_id` logika jako bulk import `parse_emails_tower_v1.3.py`)
|
||||
2. **KROK 2 — ENRICH:** fulltext do PostgreSQL `MongoEmaily.emails` — deleguje na
|
||||
existující `/scripts/5_enrich_fulltext_emails_v1.4.py --mailbox "vbuzalka@its.jnj.com"`,
|
||||
takže PG schéma, `extractor_version` i skip-logika zůstávají identické s hlavní
|
||||
Graph pipeline (krok 5 v `0_run_pipeline`). Hlavní pipeline pak tyto záznamy
|
||||
pouze skipuje (ext_v shodná, ok=true) — žádná dvojí práce.
|
||||
|
||||
## Vztah k existujícím skriptům
|
||||
|
||||
| Skript | Role |
|
||||
|--------|------|
|
||||
| `parse_emails_tower_v1.3.py` | jednorázový bulk import (70k, ~48 h) — zdroj parsovací logiky |
|
||||
| `1b_parse_emails_graph_delta_v1.0.py` | Graph delta pro schránky buzalka.cz — JNJ záměrně vynechána |
|
||||
| `5_enrich_fulltext_emails_v1.4.py` | enrich do PG — **tento skript ho volá s `--mailbox`** |
|
||||
| **`jnj_emails_to_fulltext_v1.0.py`** | **inkrementální .msg → Mongo → PG pro JNJ** |
|
||||
|
||||
Po ověření provozu je plán sloučit do hlavní pipeline `0_run_pipeline` jako další krok
|
||||
(např. „1c: JNJ msg import").
|
||||
|
||||
## Inkrementalita (co se přeskakuje)
|
||||
|
||||
- soubory zapsané ve `state.json` → klíč `done` `{filename: message_id}`
|
||||
- pojistka: `distinct("filename")` z Mongo kolekce (state se při 1. běhu sám naplní);
|
||||
`state.json` zároveň řeší duplicitní Message-ID (2 soubory → 1 dokument), které by
|
||||
se přes samotný Mongo distinct reimportovaly donekonečna
|
||||
- soubory mladší než `--min-age` s (default 300) — mohou se ještě zapisovat exportem
|
||||
- soubory které 3× selhaly (`MAX_FAIL`) — `--retry-failed` je zkusí znovu
|
||||
- flock (`.lock`) — souběžný start se ukončí bez práce
|
||||
|
||||
## Spouštění
|
||||
|
||||
```bash
|
||||
# Náhled bez zápisu (Mongo, PG i state zůstanou nedotčené):
|
||||
docker exec -it python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py --dry-run --limit 10
|
||||
|
||||
# Ostrý inkrementální běh:
|
||||
docker exec -it python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py
|
||||
|
||||
# Wrapper s datovaným logem (pro User Scripts / cron):
|
||||
/mnt/user/Scripts/Janssen_emails_to_fulltext/run_jnj_emails_to_fulltext.sh
|
||||
```
|
||||
|
||||
| Parametr | Význam |
|
||||
|----------|--------|
|
||||
| `--dry-run` | parsuje, ale nic nezapisuje (Mongo, PG, state.json) |
|
||||
| `--limit N` | max N nových souborů (test) |
|
||||
| `--min-age S` | přeskoč soubory mladší než S sekund (default 300) |
|
||||
| `--no-enrich` | jen import do Mongo, bez PG |
|
||||
| `--retry-failed` | znovu zkusit trvale selhávající soubory |
|
||||
| `--msgs-dir DIR` | jiný zdrojový adresář (default `/mnt/JNJEMAILS`) |
|
||||
|
||||
## Výstupy a logy (vše v adresáři skriptu)
|
||||
|
||||
- `logs/run_YYYYMMDD_HHMM.log` — stdout běhu (přes wrapper; `run_latest.log` symlink)
|
||||
- `logs/errors.log` — chyby parsování jednotlivých souborů
|
||||
- `state.json` — `{done: {filename: message_id}, failed: {filename: počet}}`
|
||||
|
||||
## Exit kódy
|
||||
|
||||
- `0` — OK (včetně „nic nového")
|
||||
- `1` — chyby parsování / enrich selhal / Mongo nedostupná
|
||||
|
||||
## Závislosti
|
||||
|
||||
V image `python-runner` už jsou: `extract-msg==0.55.0`, `olefile`, `pymongo`,
|
||||
`python-dateutil`. KROK 2 navíc potřebuje `psycopg`, `bs4`, `lxml` — používá je
|
||||
denně pipeline krok 5, takže jsou k dispozici.
|
||||
|
||||
## Historie verzí
|
||||
|
||||
- **1.0** (2026-06-10) — iniciální verze; parsovací logika 1:1 z
|
||||
`parse_emails_tower_v1.3.py`, enrich delegován na `5_enrich_fulltext_emails_v1.4.py`
|
||||
@@ -0,0 +1,890 @@
|
||||
"""
|
||||
jnj_emails_to_fulltext_v1.0.py
|
||||
Nazev: jnj_emails_to_fulltext_v1.0.py
|
||||
Verze: 1.0
|
||||
Datum: 2026-06-10
|
||||
Autor: vladimir.buzalka
|
||||
|
||||
Popis:
|
||||
Inkrementalni pipeline pro JNJ e-maily exportovane jako .msg soubory.
|
||||
Nove .msg pribyvaji do /mnt/JNJEMAILS (export z Outlooku, jiny pristup
|
||||
ke schrance vbuzalka@its.jnj.com neni). Skript je dvoukrokovy:
|
||||
|
||||
KROK 1 (IMPORT): nove .msg -> MongoDB emaily."vbuzalka@its.jnj.com"
|
||||
KROK 2 (ENRICH): fulltext -> PostgreSQL MongoEmaily.emails
|
||||
(vola existujici /scripts/5_enrich_fulltext_emails_v1.4.py
|
||||
s parametrem --mailbox, takze PG schema i extractor_version
|
||||
zustavaji 100% konzistentni s hlavni Graph pipeline)
|
||||
|
||||
Parsovaci logika KROKU 1 je prevzata 1:1 z parse_emails_tower_v1.3.py:
|
||||
- kaskadove otevirani (normal -> SUPPRESS_ALL -> overrideEncoding)
|
||||
- raw-OLE fallback pro degradovana textova pole (kodovani se neveri)
|
||||
- to_bson ochrana proti >int64 MAPI hodnotam
|
||||
- stejne schema dokumentu, stejna kolekce, stejny zpusob _id
|
||||
(Internet Message-ID, fallback "filename:<stem>")
|
||||
|
||||
Inkrementalita (co se preskakuje):
|
||||
- soubory ve stavovem souboru state.json (klic "done")
|
||||
- + pojistka: distinct("filename") z Mongo kolekce
|
||||
(state.json se pri prvnim behu z Monga sam naplni)
|
||||
- soubory mladsi nez --min-age sekund (jeste se mohou zapisovat)
|
||||
- soubory ktere MAX_FAIL-krat selhaly (--retry-failed je zkusi znovu)
|
||||
|
||||
Stavovy soubor state.json resi i edge-case duplicitnich Message-ID
|
||||
(2 ruzne .msg se stejnym _id by se pres Mongo distinct("filename")
|
||||
donekonecna stridave reimportovaly).
|
||||
|
||||
Prostredi:
|
||||
Bezi v Docker containeru "python-runner" na Unraid Tower.
|
||||
/mnt/user/JNJEMAILS -> /mnt/JNJEMAILS (zdrojove .msg)
|
||||
/mnt/user/Scripts -> /scripts (tento skript + enrich skript)
|
||||
MongoDB 192.168.1.76:27017 db=emaily kolekce=vbuzalka@its.jnj.com
|
||||
PostgreSQL 192.168.1.76:5432 db=MongoEmaily tabulka=emails (pres enrich)
|
||||
|
||||
Spousteni (z Unraid terminalu):
|
||||
# Nahled bez zapisu (parsuje, ale nezapisuje do Mongo ani PG):
|
||||
docker exec -it python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py --dry-run --limit 10
|
||||
|
||||
# Ostry inkrementalni beh (import + enrich):
|
||||
docker exec -it python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py
|
||||
|
||||
# Pres wrapper s datovanym logem (pro cron):
|
||||
/mnt/user/Scripts/Janssen_emails_to_fulltext/run_jnj_emails_to_fulltext.sh
|
||||
|
||||
Parametry:
|
||||
--dry-run parsovani probehne, ale NIC se nezapise (Mongo, PG, state)
|
||||
--limit N zpracovat max N novych souboru (test)
|
||||
--min-age S preskoc soubory mladsi nez S sekund (default 300)
|
||||
--no-enrich preskocit KROK 2 (jen import do Mongo)
|
||||
--retry-failed zkusit znovu i soubory ktere uz MAX_FAIL-krat selhaly
|
||||
--msgs-dir DIR jiny zdrojovy adresar (default /mnt/JNJEMAILS)
|
||||
|
||||
Vystup / logy (vse v adresari skriptu):
|
||||
stdout prubeh (wrapper presmeruje do logs/run_*.log)
|
||||
logs/errors.log chyby parsovani jednotlivych souboru
|
||||
state.json stav: done={filename: message_id}, failed={filename: pocet}
|
||||
|
||||
Exit kody:
|
||||
0 = OK (vcetne "nic noveho")
|
||||
1 = chyba (parsovani s chybami / enrich selhal / Mongo nedostupna)
|
||||
|
||||
Historie verzi:
|
||||
1.0 2026-06-10 Inicialni verze (fork parsovaci logiky parse_emails_tower_v1.3,
|
||||
enrich delegovan na 5_enrich_fulltext_emails_v1.4 --mailbox)
|
||||
"""
|
||||
|
||||
import sys
|
||||
import os
|
||||
import re
|
||||
import json
|
||||
import time
|
||||
import logging
|
||||
import argparse
|
||||
import base64
|
||||
import struct
|
||||
import subprocess
|
||||
from pathlib import Path
|
||||
from datetime import datetime, timezone
|
||||
from typing import Optional
|
||||
|
||||
import extract_msg
|
||||
from extract_msg.enums import ErrorBehavior
|
||||
import olefile
|
||||
from dateutil import parser as dtparser
|
||||
from pymongo import MongoClient, UpdateOne
|
||||
|
||||
try:
|
||||
import fcntl # jen Linux (v containeru vzdy)
|
||||
except ImportError: # lokalni vyvoj na Windows
|
||||
fcntl = None
|
||||
|
||||
if hasattr(sys.stdout, "reconfigure"):
|
||||
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
|
||||
|
||||
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
|
||||
SCRIPT_DIR = Path(__file__).resolve().parent
|
||||
MSGS_DIR = Path("/mnt/JNJEMAILS")
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
MONGO_DB = "emaily"
|
||||
MONGO_COL = "vbuzalka@its.jnj.com"
|
||||
ENRICH_SCRIPT = Path("/scripts/5_enrich_fulltext_emails_v1.4.py")
|
||||
BATCH_SIZE = 200
|
||||
MAX_FAIL = 3 # po tolika selhanich soubor preskakovat
|
||||
DEFAULT_MIN_AGE = 300 # s — mladsi soubory se jeste mohou zapisovat
|
||||
STATE_FILE = SCRIPT_DIR / "state.json"
|
||||
LOCK_FILE = SCRIPT_DIR / ".lock"
|
||||
LOG_DIR = SCRIPT_DIR / "logs"
|
||||
ERR_LOG = LOG_DIR / "errors.log"
|
||||
SCRIPT_VERSION = "1.0"
|
||||
# ──────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
LOG_DIR.mkdir(exist_ok=True)
|
||||
logging.basicConfig(
|
||||
filename=str(ERR_LOG),
|
||||
level=logging.ERROR,
|
||||
format="%(asctime)s | %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# ─── Pomocné funkce (1:1 z parse_emails_tower_v1.3) ──────────────────────────
|
||||
|
||||
def safe(obj, *attrs, default=None):
|
||||
"""Bezpecne cteni atributu — vrati prvni non-None hodnotu."""
|
||||
for attr in attrs:
|
||||
try:
|
||||
val = getattr(obj, attr, None)
|
||||
if val is None:
|
||||
continue
|
||||
if isinstance(val, str) and not val.strip():
|
||||
continue
|
||||
return val
|
||||
except Exception:
|
||||
continue
|
||||
return default
|
||||
|
||||
|
||||
def parse_date(raw) -> Optional[datetime]:
|
||||
"""Libovolny datum -> UTC datetime bez tzinfo (pro MongoDB)."""
|
||||
if raw is None:
|
||||
return None
|
||||
if isinstance(raw, datetime):
|
||||
if raw.tzinfo:
|
||||
return raw.astimezone(timezone.utc).replace(tzinfo=None)
|
||||
return raw
|
||||
try:
|
||||
dt = dtparser.parse(str(raw))
|
||||
if dt.tzinfo:
|
||||
return dt.astimezone(timezone.utc).replace(tzinfo=None)
|
||||
return dt
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
_INT64_MIN, _INT64_MAX = -(2 ** 63), 2 ** 63 - 1
|
||||
|
||||
|
||||
def to_bson(val):
|
||||
"""Konvertuje hodnotu na BSON-serializovatelny typ.
|
||||
|
||||
BSON umi jen signed int64 — velke MAPI hodnoty (PR_CHANGE_KEY, FILETIME)
|
||||
mimo rozsah prevadime na string, jinak cely bulk_write spadne.
|
||||
"""
|
||||
if isinstance(val, bool): # bool PRED int (isinstance(True, int))
|
||||
return val
|
||||
if isinstance(val, bytes):
|
||||
return val.hex() if len(val) <= 128 else f"<bytes:{len(val)}>"
|
||||
if isinstance(val, datetime):
|
||||
return parse_date(val)
|
||||
if isinstance(val, int):
|
||||
return val if _INT64_MIN <= val <= _INT64_MAX else str(val)
|
||||
if isinstance(val, (str, float, type(None))):
|
||||
return val
|
||||
if isinstance(val, list):
|
||||
return [to_bson(v) for v in val]
|
||||
try:
|
||||
iv = int(val)
|
||||
return iv if _INT64_MIN <= iv <= _INT64_MAX else str(iv)
|
||||
except Exception:
|
||||
pass
|
||||
return str(val)
|
||||
|
||||
|
||||
def extract_headers(msg) -> dict:
|
||||
headers = {}
|
||||
try:
|
||||
hdr = msg.header
|
||||
if not hdr:
|
||||
return {}
|
||||
from email.header import decode_header as _dh
|
||||
|
||||
def _decode(v: str) -> str:
|
||||
try:
|
||||
parts = _dh(v)
|
||||
out = ""
|
||||
for part, enc in parts:
|
||||
out += part.decode(enc or "utf-8", errors="replace") if isinstance(part, bytes) else part
|
||||
return out
|
||||
except Exception:
|
||||
return v
|
||||
|
||||
for key in set(hdr.keys()):
|
||||
k = key.lower().replace("-", "_")
|
||||
vals = [_decode(v) for v in hdr.get_all(key, [])]
|
||||
headers[k] = vals if len(vals) > 1 else (vals[0] if vals else "")
|
||||
except Exception as e:
|
||||
logging.error("extract_headers: %s", e)
|
||||
return headers
|
||||
|
||||
|
||||
def extract_recipients(msg) -> list:
|
||||
result = []
|
||||
type_map = {1: "to", 2: "cc", 3: "bcc"}
|
||||
try:
|
||||
for r in msg.recipients:
|
||||
rtype = getattr(r, "type", 1)
|
||||
try:
|
||||
rtype = int(rtype)
|
||||
except Exception:
|
||||
try:
|
||||
rtype = int(rtype.value)
|
||||
except Exception:
|
||||
rtype = 1
|
||||
rec = {
|
||||
"type": type_map.get(rtype, "to"),
|
||||
"email": safe(r, "email", default=""),
|
||||
"name": safe(r, "name", default=""),
|
||||
}
|
||||
result.append(rec)
|
||||
except Exception as e:
|
||||
logging.error("extract_recipients: %s", e)
|
||||
return result
|
||||
|
||||
|
||||
def extract_attachments(msg) -> list:
|
||||
result = []
|
||||
try:
|
||||
for att in msg.attachments:
|
||||
fname = safe(att, "longFilename", "shortFilename", default="")
|
||||
if not fname:
|
||||
continue
|
||||
size = 0
|
||||
try:
|
||||
d = att.data
|
||||
size = len(d) if d else 0
|
||||
except Exception:
|
||||
pass
|
||||
result.append({
|
||||
"filename": fname,
|
||||
"size_bytes": size,
|
||||
"mime_type": safe(att, "mimetype", "mimeType", default="application/octet-stream"),
|
||||
"content_id": safe(att, "cid", default=None),
|
||||
"is_inline": bool(safe(att, "isInline", default=False)),
|
||||
})
|
||||
except Exception as e:
|
||||
logging.error("extract_attachments: %s", e)
|
||||
return result
|
||||
|
||||
|
||||
def extract_mapi_props(msg) -> dict:
|
||||
"""Vsechny raw MAPI properties jako {0xXXXX: value}."""
|
||||
result = {}
|
||||
try:
|
||||
props = msg.props
|
||||
if not hasattr(props, "items"):
|
||||
return {}
|
||||
for key, prop in props.items():
|
||||
try:
|
||||
val = to_bson(prop.value)
|
||||
prop_id = f"0x{key[:4].upper()}" if len(key) >= 4 else f"0x{key.upper()}"
|
||||
result[prop_id] = val
|
||||
except Exception:
|
||||
pass
|
||||
except Exception as e:
|
||||
logging.error("extract_mapi_props: %s", e)
|
||||
return result
|
||||
|
||||
|
||||
# ─── Tolerantní otevírání a raw-OLE fallback (1:1 z v1.3) ────────────────────
|
||||
|
||||
_CPID_TO_CODEC = {
|
||||
1250: "cp1250", 1251: "cp1251", 1252: "cp1252", 1253: "cp1253",
|
||||
1254: "cp1254", 1255: "cp1255", 1256: "cp1256", 1257: "cp1257",
|
||||
1258: "cp1258", 874: "cp874", 932: "shift_jis", 936: "gb2312",
|
||||
949: "euc_kr", 950: "big5", 65001: "utf-8", 28591: "iso-8859-1",
|
||||
28592: "iso-8859-2", 20127: "ascii",
|
||||
}
|
||||
|
||||
|
||||
def _read_u32_prop(ole, propid):
|
||||
"""Precte 32-bit hodnotu MAPI property z top-level __properties_version1.0."""
|
||||
try:
|
||||
data = ole.openstream("__properties_version1.0").read()
|
||||
except Exception:
|
||||
return None
|
||||
body = data[32:]
|
||||
for i in range(0, len(body) - 16 + 1, 16):
|
||||
rec = body[i:i + 16]
|
||||
tag = struct.unpack("<I", rec[0:4])[0]
|
||||
if ((tag >> 16) & 0xFFFF) == propid:
|
||||
return struct.unpack("<I", rec[8:12])[0]
|
||||
return None
|
||||
|
||||
|
||||
def _detect_cpid(ole) -> Optional[str]:
|
||||
"""Codec dle PR_INTERNET_CPID / PR_MESSAGE_CODEPAGE (napoveda, ne dogma)."""
|
||||
for pid in (0x3FDE, 0x3FFD):
|
||||
codec = _CPID_TO_CODEC.get(_read_u32_prop(ole, pid))
|
||||
if codec and codec not in ("utf-8", "ascii"):
|
||||
return codec
|
||||
return None
|
||||
|
||||
|
||||
def _cascade_decode(raw: bytes, is_unicode: bool, cpid_codec: Optional[str]) -> str:
|
||||
"""Dekoduje bajty MAPI stringu — hlavickam se neveri, kaskada strict pokusu."""
|
||||
if not raw:
|
||||
return ""
|
||||
if is_unicode:
|
||||
try:
|
||||
return raw.decode("utf-16-le")
|
||||
except Exception:
|
||||
return raw.decode("utf-16-le", errors="replace")
|
||||
order = ["utf-8"]
|
||||
if cpid_codec:
|
||||
order.append(cpid_codec)
|
||||
order += ["cp1250", "cp1252", "gb2312", "big5"]
|
||||
for enc in order:
|
||||
try:
|
||||
return raw.decode(enc, errors="strict")
|
||||
except Exception:
|
||||
continue
|
||||
return raw.decode("latin-1", errors="replace")
|
||||
|
||||
|
||||
def _raw_mapi_strings(msg_path: Path) -> dict:
|
||||
"""Cte klicova textova MAPI pole PRIMO z OLE (mimo extract_msg)."""
|
||||
out = {"subject": "", "normalized_subject": "", "sender_name": "",
|
||||
"sender_email": "", "sender_smtp": "", "body_text": "", "body_html": ""}
|
||||
try:
|
||||
ole = olefile.OleFileIO(str(msg_path))
|
||||
except Exception:
|
||||
return out
|
||||
try:
|
||||
cpid = _detect_cpid(ole)
|
||||
wanted = {
|
||||
"0037": "subject", "0E1D": "normalized_subject",
|
||||
"0C1A": "sender_name", "5D01": "sender_smtp",
|
||||
"0C1F": "sender_email", "1000": "body_text", "1013": "body_html",
|
||||
}
|
||||
prefix = "__substg1.0_"
|
||||
found = {}
|
||||
for entry in ole.listdir():
|
||||
if len(entry) != 1:
|
||||
continue
|
||||
name = entry[0]
|
||||
if not name.startswith(prefix):
|
||||
continue
|
||||
tag = name[len(prefix):len(prefix) + 4].upper()
|
||||
key = wanted.get(tag)
|
||||
if not key:
|
||||
continue
|
||||
typ = name[-4:].upper()
|
||||
prio = {"001F": 3, "001E": 2, "0102": 1}.get(typ, 0)
|
||||
if prio == 0:
|
||||
continue
|
||||
prev = found.get(key)
|
||||
if prev and prev[0] >= prio:
|
||||
continue
|
||||
try:
|
||||
raw = ole.openstream(entry).read()
|
||||
val = _cascade_decode(raw, typ == "001F", cpid)
|
||||
except Exception:
|
||||
continue
|
||||
found[key] = (prio, val)
|
||||
for key, (_, val) in found.items():
|
||||
out[key] = val
|
||||
finally:
|
||||
ole.close()
|
||||
return out
|
||||
|
||||
|
||||
def _degraded(s) -> bool:
|
||||
"""Pole je degradovane: prazdne nebo obsahuje U+FFFD."""
|
||||
return (not s) or ("�" in s)
|
||||
|
||||
|
||||
def open_message(msg_path: Path):
|
||||
"""Kaskadove otevreni .msg -> (msg, mode) nebo (None, None)."""
|
||||
try:
|
||||
return extract_msg.Message(str(msg_path)), "normal"
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
return extract_msg.Message(
|
||||
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL), "suppress_all"
|
||||
except Exception:
|
||||
pass
|
||||
encs = []
|
||||
try:
|
||||
ole = olefile.OleFileIO(str(msg_path))
|
||||
c = _detect_cpid(ole)
|
||||
ole.close()
|
||||
if c:
|
||||
encs.append(c)
|
||||
except Exception:
|
||||
pass
|
||||
for e in encs + ["cp1250", "cp1252"]:
|
||||
try:
|
||||
return extract_msg.Message(
|
||||
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL,
|
||||
overrideEncoding=e), f"override:{e}"
|
||||
except Exception:
|
||||
continue
|
||||
return None, None
|
||||
|
||||
|
||||
# ─── Hlavní extrakce (1:1 z v1.3) ────────────────────────────────────────────
|
||||
|
||||
def extract_message(msg_path: Path) -> Optional[dict]:
|
||||
"""Parsuje jeden .msg soubor -> MongoDB dokument."""
|
||||
msg, parse_mode = open_message(msg_path)
|
||||
if msg is None:
|
||||
logging.error("open failed [%s]: vsechny pokusy o otevreni selhaly", msg_path.name)
|
||||
return None
|
||||
|
||||
try:
|
||||
mid = None
|
||||
for attr in ("messageId", "message_id", "internetMessageId"):
|
||||
mid = safe(msg, attr)
|
||||
if mid:
|
||||
break
|
||||
if not mid:
|
||||
mid = f"filename:{msg_path.stem}"
|
||||
mid = str(mid).strip()
|
||||
|
||||
try:
|
||||
subject = msg.subject or ""
|
||||
except Exception:
|
||||
subject = ""
|
||||
|
||||
normalized_subject = safe(msg, "normalizedSubject", "normalized_subject", default="")
|
||||
|
||||
try:
|
||||
body_text = msg.body or ""
|
||||
except Exception:
|
||||
body_text = ""
|
||||
|
||||
body_html = None
|
||||
try:
|
||||
bh = msg.htmlBody
|
||||
if isinstance(bh, bytes):
|
||||
bh = bh.decode("utf-8", errors="replace")
|
||||
if bh:
|
||||
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
try:
|
||||
sender_email = msg.sender or ""
|
||||
except Exception:
|
||||
sender_email = ""
|
||||
|
||||
sender_name = safe(msg, "senderName", "sender_name", default="")
|
||||
sender_smtp = safe(msg, "senderSmtpAddress", "sent_representing_smtp_address", default="")
|
||||
|
||||
recipients = extract_recipients(msg)
|
||||
|
||||
try:
|
||||
to_raw = msg.to or ""
|
||||
except Exception:
|
||||
to_raw = ""
|
||||
try:
|
||||
cc_raw = msg.cc or ""
|
||||
except Exception:
|
||||
cc_raw = ""
|
||||
try:
|
||||
bcc_raw = getattr(msg, "bcc", None) or ""
|
||||
except Exception:
|
||||
bcc_raw = ""
|
||||
|
||||
display_to = safe(msg, "displayTo", "display_to", default="")
|
||||
display_cc = safe(msg, "displayCc", "display_cc", default="")
|
||||
|
||||
try:
|
||||
received_at = parse_date(msg.date)
|
||||
except Exception:
|
||||
received_at = None
|
||||
|
||||
sent_at = None
|
||||
for attr in ("clientSubmitTime", "client_submit_time", "sentOn"):
|
||||
v = safe(msg, attr)
|
||||
if v:
|
||||
sent_at = parse_date(v)
|
||||
break
|
||||
|
||||
importance = 1
|
||||
try:
|
||||
v = msg.importance
|
||||
if v is not None:
|
||||
importance = int(v)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
sensitivity = 0
|
||||
try:
|
||||
v = getattr(msg, "sensitivity", None)
|
||||
if v is not None:
|
||||
sensitivity = int(v)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
flag_status = 0
|
||||
try:
|
||||
v = safe(msg, "flagStatus", "flag_status")
|
||||
if v is not None:
|
||||
flag_status = int(v)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
conversation_topic = safe(msg, "conversationTopic", "conversation_topic", default="")
|
||||
|
||||
conversation_index = ""
|
||||
try:
|
||||
ci = safe(msg, "conversationIndex", "conversation_index")
|
||||
if isinstance(ci, bytes):
|
||||
conversation_index = base64.b64encode(ci).decode()
|
||||
elif ci:
|
||||
conversation_index = str(ci)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
in_reply_to = safe(msg, "inReplyTo", "in_reply_to", default="")
|
||||
|
||||
internet_refs = []
|
||||
try:
|
||||
refs = safe(msg, "internetReferences", "internet_references")
|
||||
if isinstance(refs, list):
|
||||
internet_refs = refs
|
||||
elif isinstance(refs, str) and refs:
|
||||
internet_refs = [r.strip() for r in refs.split() if r.strip()]
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
categories = []
|
||||
try:
|
||||
cats = safe(msg, "categories")
|
||||
if isinstance(cats, list):
|
||||
categories = [str(c) for c in cats if c]
|
||||
elif isinstance(cats, str) and cats:
|
||||
categories = [c.strip() for c in re.split(r"[;,]", cats) if c.strip()]
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
read_receipt = bool(safe(msg, "readReceiptRequested", "read_receipt_requested", default=False))
|
||||
delivery_receipt = bool(safe(msg, "deliveryReceiptRequested", "delivery_receipt_requested", default=False))
|
||||
|
||||
headers = extract_headers(msg)
|
||||
|
||||
if not in_reply_to:
|
||||
in_reply_to = headers.get("in_reply_to", "")
|
||||
if not internet_refs:
|
||||
refs_str = headers.get("references", "")
|
||||
if isinstance(refs_str, str) and refs_str:
|
||||
internet_refs = [r.strip() for r in refs_str.split() if r.strip()]
|
||||
|
||||
attachments = extract_attachments(msg)
|
||||
mapi_raw = extract_mapi_props(msg)
|
||||
|
||||
msg.close()
|
||||
|
||||
# Raw-OLE fallback pro degradovana textova pole
|
||||
parse_degraded = parse_mode != "normal"
|
||||
forced = parse_mode != "normal"
|
||||
if (forced or _degraded(subject) or _degraded(body_text)
|
||||
or _degraded(sender_email) or (body_html and "�" in body_html)):
|
||||
raw = _raw_mapi_strings(msg_path)
|
||||
if raw["subject"] and (forced or _degraded(subject)):
|
||||
subject = raw["subject"]
|
||||
if raw["normalized_subject"] and (forced or _degraded(normalized_subject)):
|
||||
normalized_subject = raw["normalized_subject"]
|
||||
if raw["body_text"] and (forced or _degraded(body_text)):
|
||||
body_text = raw["body_text"]
|
||||
if raw["body_html"] and (forced or not body_html or "�" in body_html):
|
||||
bh = raw["body_html"]
|
||||
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
|
||||
if (raw["sender_smtp"] or raw["sender_email"]) and (forced or _degraded(sender_email)):
|
||||
sender_email = raw["sender_smtp"] or raw["sender_email"]
|
||||
if raw["sender_name"] and (forced or _degraded(sender_name)):
|
||||
sender_name = raw["sender_name"]
|
||||
if raw["sender_smtp"] and not sender_smtp:
|
||||
sender_smtp = raw["sender_smtp"]
|
||||
|
||||
return {
|
||||
"_id": mid,
|
||||
"filename": msg_path.name,
|
||||
|
||||
"subject": subject,
|
||||
"normalized_subject": normalized_subject,
|
||||
"importance": importance,
|
||||
"sensitivity": sensitivity,
|
||||
"flag_status": flag_status,
|
||||
"read_receipt_requested": read_receipt,
|
||||
"delivery_receipt_requested": delivery_receipt,
|
||||
"has_attachments": len(attachments) > 0,
|
||||
"attachment_count": len(attachments),
|
||||
"message_size_bytes": msg_path.stat().st_size,
|
||||
|
||||
"conversation_topic": conversation_topic,
|
||||
"conversation_index": conversation_index,
|
||||
"in_reply_to": in_reply_to,
|
||||
"internet_references": internet_refs,
|
||||
"categories": categories,
|
||||
|
||||
"received_at": received_at,
|
||||
"sent_at": sent_at,
|
||||
|
||||
"sender": {
|
||||
"email": sender_email,
|
||||
"name": sender_name,
|
||||
"smtp": sender_smtp,
|
||||
},
|
||||
"to": to_raw,
|
||||
"cc": cc_raw,
|
||||
"bcc": bcc_raw,
|
||||
"display_to": display_to,
|
||||
"display_cc": display_cc,
|
||||
"recipients": recipients,
|
||||
|
||||
"body_text": body_text,
|
||||
"body_html": body_html,
|
||||
|
||||
"attachments": attachments,
|
||||
"headers": headers,
|
||||
"mapi": mapi_raw,
|
||||
|
||||
"parse_mode": parse_mode,
|
||||
"parse_degraded": parse_degraded,
|
||||
|
||||
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logging.error("extract_message failed [%s]: %s", msg_path.name, e)
|
||||
return None
|
||||
|
||||
|
||||
# ─── Stav (state.json) ───────────────────────────────────────────────────────
|
||||
|
||||
def load_state() -> dict:
|
||||
if STATE_FILE.exists():
|
||||
try:
|
||||
st = json.loads(STATE_FILE.read_text(encoding="utf-8"))
|
||||
if isinstance(st, dict) and "done" in st and "failed" in st:
|
||||
return st
|
||||
except Exception as e:
|
||||
print(f" VAROVANI: state.json nesel nacist ({e}) -- zacinam s prazdnym")
|
||||
return {"done": {}, "failed": {}}
|
||||
|
||||
|
||||
def save_state(state: dict) -> None:
|
||||
tmp = STATE_FILE.with_suffix(".json.tmp")
|
||||
tmp.write_text(json.dumps(state, ensure_ascii=False), encoding="utf-8")
|
||||
os.replace(tmp, STATE_FILE)
|
||||
|
||||
|
||||
# ─── Lock proti soubehu ──────────────────────────────────────────────────────
|
||||
|
||||
def acquire_lock():
|
||||
"""Vrati otevreny lock file handle, nebo ukonci skript pokud uz bezi jiny."""
|
||||
if fcntl is None:
|
||||
return None
|
||||
lf = open(LOCK_FILE, "w")
|
||||
try:
|
||||
fcntl.flock(lf, fcntl.LOCK_EX | fcntl.LOCK_NB)
|
||||
except OSError:
|
||||
print("Jiny beh jnj_emails_to_fulltext jeste probiha (lock) -- koncim.")
|
||||
sys.exit(0)
|
||||
return lf
|
||||
|
||||
|
||||
# ─── KROK 2: enrich (delegovano na existujici skript pipeline) ───────────────
|
||||
|
||||
def run_enrich() -> int:
|
||||
"""Spusti 5_enrich_fulltext_emails --mailbox <JNJ kolekce>. Vraci exit code."""
|
||||
if not ENRICH_SCRIPT.exists():
|
||||
print(f" CHYBA: enrich skript nenalezen: {ENRICH_SCRIPT}")
|
||||
return 1
|
||||
cmd = [sys.executable, str(ENRICH_SCRIPT), "--mailbox", MONGO_COL]
|
||||
print(f"\n=== KROK 2: ENRICH (PG fulltext) ===")
|
||||
print(f" {' '.join(cmd)}")
|
||||
sys.stdout.flush()
|
||||
r = subprocess.run(cmd)
|
||||
print(f" enrich exit code: {r.returncode}")
|
||||
return r.returncode
|
||||
|
||||
|
||||
# ─── MAIN ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser(description=f"jnj_emails_to_fulltext v{SCRIPT_VERSION}")
|
||||
ap.add_argument("--msgs-dir", default=str(MSGS_DIR),
|
||||
help="Cesta k .msg souborum")
|
||||
ap.add_argument("--limit", type=int, default=0,
|
||||
help="Zpracovat max N novych souboru (0 = vse)")
|
||||
ap.add_argument("--min-age", type=int, default=DEFAULT_MIN_AGE,
|
||||
help=f"Preskoc soubory mladsi nez S sekund (default {DEFAULT_MIN_AGE})")
|
||||
ap.add_argument("--dry-run", action="store_true",
|
||||
help="Parsuje, ale NEZAPISUJE (Mongo, PG, state.json)")
|
||||
ap.add_argument("--no-enrich", action="store_true",
|
||||
help="Preskocit KROK 2 (PG fulltext)")
|
||||
ap.add_argument("--retry-failed", action="store_true",
|
||||
help=f"Zkusit znovu i soubory s {MAX_FAIL}+ selhanimi")
|
||||
args = ap.parse_args()
|
||||
|
||||
msgs_dir = Path(args.msgs_dir)
|
||||
start = datetime.now()
|
||||
|
||||
print(f"=== jnj_emails_to_fulltext v{SCRIPT_VERSION} ===")
|
||||
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}{' [DRY-RUN]' if args.dry_run else ''}")
|
||||
print(f"Zdroj: {msgs_dir}")
|
||||
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
|
||||
|
||||
lock = acquire_lock() # noqa: F841 — drzime handle do konce behu
|
||||
|
||||
# MongoDB
|
||||
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
|
||||
try:
|
||||
client.admin.command("ping")
|
||||
print(" MongoDB OK")
|
||||
except Exception as e:
|
||||
print(f" CHYBA: MongoDB neni dostupna -- {e}")
|
||||
return 1
|
||||
|
||||
col = client[MONGO_DB][MONGO_COL]
|
||||
|
||||
# Stav + seed z Mongo (filename je v kazdem dokumentu z tower importu)
|
||||
state = load_state()
|
||||
print(" Nacitam seznam jiz importovanych souboru z MongoDB...")
|
||||
mongo_filenames = set(col.distinct("filename"))
|
||||
print(f" v Mongo: {len(mongo_filenames)} souboru, ve state.json: {len(state['done'])}")
|
||||
known = set(state["done"]) | mongo_filenames
|
||||
|
||||
failed_skip = set()
|
||||
if not args.retry_failed:
|
||||
failed_skip = {fn for fn, cnt in state["failed"].items() if cnt >= MAX_FAIL}
|
||||
|
||||
# Scan
|
||||
print(f"\nSkenuji {msgs_dir} ...")
|
||||
all_files = sorted(msgs_dir.glob("*.msg"))
|
||||
now_ts = time.time()
|
||||
|
||||
too_young = 0
|
||||
candidates = []
|
||||
for f in all_files:
|
||||
if f.name in known or f.name in failed_skip:
|
||||
continue
|
||||
try:
|
||||
if now_ts - f.stat().st_mtime < args.min_age:
|
||||
too_young += 1
|
||||
continue
|
||||
except OSError:
|
||||
continue
|
||||
candidates.append(f)
|
||||
|
||||
if args.limit:
|
||||
candidates = candidates[:args.limit]
|
||||
|
||||
total = len(candidates)
|
||||
print(f" Celkem .msg na disku: {len(all_files)}")
|
||||
print(f" Jiz importovano (skip): {len(known & {x.name for x in all_files})}")
|
||||
print(f" Trvale selhavajici: {len(failed_skip)}")
|
||||
print(f" Mladsi nez {args.min_age}s: {too_young}")
|
||||
print(f" Ke zpracovani: {total}{' (limit)' if args.limit else ''}\n")
|
||||
|
||||
imported = 0
|
||||
err_count = 0
|
||||
|
||||
if total == 0:
|
||||
print("Nic noveho k importu.")
|
||||
else:
|
||||
batch: list[tuple[str, str, UpdateOne]] = [] # (filename, _id, op)
|
||||
|
||||
def flush():
|
||||
nonlocal imported, err_count
|
||||
if not batch:
|
||||
return
|
||||
if args.dry_run:
|
||||
batch.clear()
|
||||
return
|
||||
ops = [op for _, _, op in batch]
|
||||
try:
|
||||
col.bulk_write(ops, ordered=False)
|
||||
for fn, mid, _ in batch:
|
||||
state["done"][fn] = mid
|
||||
state["failed"].pop(fn, None)
|
||||
except Exception as e:
|
||||
# Per-dokument fallback — chyba zahodi jen vadny zaznam
|
||||
logging.error("bulk_write spadl (%s) -- prepinam na per-dokument", e)
|
||||
print(f" CHYBA bulk_write: {e} -- zkousim per-dokument")
|
||||
for fn, mid, op in batch:
|
||||
try:
|
||||
col.bulk_write([op], ordered=False)
|
||||
state["done"][fn] = mid
|
||||
state["failed"].pop(fn, None)
|
||||
except Exception as e2:
|
||||
logging.error("per-dokument selhal [%s, _id=%s]: %s", fn, mid, e2)
|
||||
print(f" ZAHOZEN {fn} (_id={mid}): {e2}")
|
||||
state["failed"][fn] = state["failed"].get(fn, 0) + 1
|
||||
imported -= 1
|
||||
err_count += 1
|
||||
save_state(state)
|
||||
batch.clear()
|
||||
|
||||
for i, msg_path in enumerate(candidates, 1):
|
||||
doc = extract_message(msg_path)
|
||||
|
||||
if doc is None:
|
||||
err_count += 1
|
||||
if not args.dry_run:
|
||||
state["failed"][msg_path.name] = state["failed"].get(msg_path.name, 0) + 1
|
||||
else:
|
||||
batch.append((msg_path.name, doc["_id"],
|
||||
UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True)))
|
||||
imported += 1
|
||||
|
||||
if len(batch) >= BATCH_SIZE:
|
||||
flush()
|
||||
|
||||
status = "ERR " if doc is None else "OK "
|
||||
subject_str = (doc.get("subject") or "")[:60] if doc else "?"
|
||||
sender_str = (doc.get("sender", {}).get("email") or "")[:40] if doc else "?"
|
||||
print(f" {i:>5}/{total} {status} {subject_str:<60} {sender_str}")
|
||||
|
||||
if i % 500 == 0:
|
||||
elapsed = (datetime.now() - start).total_seconds()
|
||||
rate = i / elapsed if elapsed > 0 else 0
|
||||
eta_s = int((total - i) / rate) if rate > 0 else 0
|
||||
print(f" {'-'*80}")
|
||||
print(f" Prubeh: ok={imported} err={err_count} "
|
||||
f"{rate:.1f} msg/s ETA {eta_s//3600}h{(eta_s%3600)//60}m")
|
||||
print(f" {'-'*80}")
|
||||
|
||||
flush()
|
||||
if not args.dry_run:
|
||||
save_state(state)
|
||||
|
||||
elapsed_total = (datetime.now() - start).total_seconds()
|
||||
print(f"\n{'='*52}")
|
||||
print(f"KROK 1 (import): ok={imported} err={err_count}"
|
||||
f"{' [DRY-RUN — nic nezapsano]' if args.dry_run else ''}")
|
||||
print(f"Cas importu: {int(elapsed_total//60)}m {int(elapsed_total%60)}s")
|
||||
if not args.dry_run:
|
||||
print(f"Dokumentu v kolekci: {col.estimated_document_count()}")
|
||||
|
||||
client.close()
|
||||
|
||||
# KROK 2 — enrich do PG (jen pokud je co a neni dry-run)
|
||||
enrich_rc = 0
|
||||
if args.dry_run:
|
||||
print("\nKROK 2 (enrich): preskocen [DRY-RUN]")
|
||||
elif args.no_enrich:
|
||||
print("\nKROK 2 (enrich): preskocen [--no-enrich]")
|
||||
elif imported == 0:
|
||||
print("\nKROK 2 (enrich): preskocen (zadne nove e-maily)")
|
||||
else:
|
||||
enrich_rc = run_enrich()
|
||||
|
||||
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
if err_count:
|
||||
print(f"Chyby parsovani logovany do: {ERR_LOG}")
|
||||
|
||||
return 1 if (err_count > 0 or enrich_rc != 0) else 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
raise SystemExit(main())
|
||||
except KeyboardInterrupt:
|
||||
print("\nPreruseno uzivatelem")
|
||||
sys.exit(130)
|
||||
@@ -0,0 +1,40 @@
|
||||
#!/bin/bash
|
||||
# ============================================================================
|
||||
# Wrapper pro jnj_emails_to_fulltext_v1.0.py (JNJ .msg -> Mongo -> PG fulltext).
|
||||
# Bezi na Unraid hostu, skript spousti uvnitr containeru python-runner.
|
||||
# Datovane logy + run_latest.log symlink, uklid logu po 30 dnech.
|
||||
#
|
||||
# Instalace pres User Scripts plugin nebo /etc/cron.d (zatim NEINSTALOVANO):
|
||||
# 30 6,18 * * * /mnt/user/Scripts/Janssen_emails_to_fulltext/run_jnj_emails_to_fulltext.sh
|
||||
# ============================================================================
|
||||
|
||||
set -u
|
||||
|
||||
BASE_DIR="/mnt/user/Scripts/Janssen_emails_to_fulltext"
|
||||
LOG_DIR="${BASE_DIR}/logs"
|
||||
TIMESTAMP=$(date +%Y%m%d_%H%M)
|
||||
LOG_FILE="${LOG_DIR}/run_${TIMESTAMP}.log"
|
||||
LATEST_LINK="${LOG_DIR}/run_latest.log"
|
||||
RETENTION_DAYS=30
|
||||
|
||||
mkdir -p "$LOG_DIR"
|
||||
|
||||
echo "=== jnj_emails_to_fulltext run @ $(date '+%Y-%m-%d %H:%M:%S') ===" >> "$LOG_FILE"
|
||||
|
||||
if ! docker inspect -f '{{.State.Running}}' python-runner 2>/dev/null | grep -q true; then
|
||||
echo "ERROR: python-runner container is not running" >> "$LOG_FILE"
|
||||
docker start python-runner >> "$LOG_FILE" 2>&1 || exit 1
|
||||
sleep 5
|
||||
fi
|
||||
|
||||
docker exec python-runner python /scripts/Janssen_emails_to_fulltext/jnj_emails_to_fulltext_v1.0.py "$@" >> "$LOG_FILE" 2>&1
|
||||
RET=$?
|
||||
|
||||
echo "" >> "$LOG_FILE"
|
||||
echo "=== Wrapper finished @ $(date '+%Y-%m-%d %H:%M:%S') exit=$RET ===" >> "$LOG_FILE"
|
||||
|
||||
ln -sf "$LOG_FILE" "$LATEST_LINK"
|
||||
|
||||
find "$LOG_DIR" -name 'run_*.log' -type f -mtime +${RETENTION_DAYS} -delete
|
||||
|
||||
exit $RET
|
||||
Reference in New Issue
Block a user