Files
janssen/Python-runner/Trash/parse_emails_graph_v1.2.py
2026-06-02 17:20:20 +02:00

611 lines
24 KiB
Python

"""
parse_emails_graph_v1.2.py
Nazev: parse_emails_graph_v1.2.py
Verze: 1.2
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Cte vsechny emaily ze schranky ordinace@buzalkova.cz primo pres
Microsoft Graph API a importuje je jako dokumenty do MongoDB.
Ze kazde zpravy extrahuje vsechny dostupne vlastnosti:
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
- cas doruceni, odeslani, vytvoreni, modifikace (UTC)
- telo HTML (max 2 MB) + textovy preview
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag, graph_att_id)
- internet headers (SPF, DKIM, Received, X-*, ...)
- MAPI-ekvivalenty: dulezitost, priznak, konverzacni vlakno,
kategorie, In-Reply-To, References, ...
- navic: isRead, isDraft, folder_path, inferenceClassification
Prochazi VSECHNY slozky schranky rekurzivne (Inbox, Sent, Deleted,
archivni slozky, ...).
DB: emaily
Kolekce: ordinace@buzalkova.cz
_id: Internet Message-ID (nebo "graphid:<id>" jako fallback)
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
# Prvni import (vsechno):
python parse_emails_graph_v1.2.py
# Test na prvnich 50:
python parse_emails_graph_v1.2.py --limit 50 --no-indexes
# Jen jedna slozka:
python parse_emails_graph_v1.2.py --folder Inbox
# Pokracovani po preruseni (pouze nove):
python parse_emails_graph_v1.2.py --mode new-only
# Pravidelny sync (aktualizuje is_read, flag, slozku; importuje nove):
python parse_emails_graph_v1.2.py --mode sync
# Plny reimport vsech dat:
python parse_emails_graph_v1.2.py --mode full
Rezimy (--mode):
full Plny upsert vsech poli pro kazdou zpravu (vychozi)
new-only Preskoci zpravy ktere uz jsou v MongoDB, importuje jen nove
sync Existujici: aktualizuje jen is_read/flag_status/categories/
modified_at/folder_path. Nove zpravy importuje cely.
Idealni pro pravidelne spousteni.
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
Struktura dokumentu v MongoDB:
_id Internet Message-ID (nebo graphid: fallback)
graph_id Graph API message ID
subject predmet zpravy
normalized_subject predmet bez RE:/FW:/AW: prefixu
importance 0=nizka 1=normalni 2=vysoka
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
is_read bool — aktualni stav precteni ve schrance
is_draft bool
has_attachments bool
attachment_count int
inference_classification focused / other
categories [str]
conversation_id Graph conversationId
conversation_index base64 conversationIndex
conversation_topic tema vlakna (z internet headers Thread-Topic)
in_reply_to Message-ID predchozi zpravy
internet_references [Message-ID]
received_at datetime UTC
sent_at datetime UTC
created_at datetime UTC
modified_at datetime UTC
folder_id Graph parentFolderId
folder_path cela cesta slozky (napr. Inbox/Subfolder)
sender.email emailova adresa odesilatele
sender.name zobrazovane jmeno
to retezec To (joined)
cc retezec CC
bcc retezec BCC
recipients [{type, email, name}]
body_html HTML telo (max 2 MB)
body_preview textovy nahled (max 255 znaku)
attachments [{filename, size_bytes, mime_type, is_inline, graph_att_id}]
headers dict internet headers
parsed_at datetime UTC
Indexy:
received_at, sent_at, sender.email, graph_id (unique),
conversation_id, folder_path, has_attachments, categories,
importance, flag_status, is_read,
text_search (subject + body_preview + to + cc)
Historie verzi:
1.0 2026-06-02 Inicialni verze
1.1 2026-06-02 Pridany rezimy --mode full/new-only/sync;
odstranen --skip-existing (nahrazen --mode new-only)
1.2 2026-06-02 $expand attachments s $select (bez contentBytes — rychlejsi);
prilohy ukladaji graph_att_id pro prime stazeni bez name-matchingu
"""
import sys
import re
import logging
import argparse
import base64
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_MAILBOX = "ordinace@buzalkova.cz"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL = "ordinace@buzalkova.cz"
BATCH_SIZE = 100
PAGE_SIZE = 50
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.2"
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
IMPORTANCE_MAP = {"low": 0, "normal": 1, "high": 2}
FLAG_STATUS_MAP = {"notFlagged": 0, "flagged": 1, "complete": 2}
RE_SUBJECT = re.compile(r"^(RE|FW|AW|SV|VS|TR|WG|odpov[eě]d[ťt]|fwd?)[:\s]+", re.IGNORECASE)
# $expand prilohy bez contentBytes — jen metadata co potrebujeme
ATT_EXPAND = "attachments($select=id,name,contentType,size,isInline)"
MSG_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification,internetMessageHeaders"
)
# Pro sync mode staci jen menitelna pole — rychlejsi fetch
MSG_SELECT_SYNC = (
"id,internetMessageId,isRead,isDraft,flag,categories,"
"lastModifiedDateTime,parentFolderId,importance"
)
# ─── Graph API helpers ────────────────────────────────────────────────────────
_graph_token: Optional[str] = None
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET failed after retry: {url}")
def get_all_folders(parent_id: str = None, parent_path: str = "") -> list[dict]:
"""Rekurzivne nacte vsechny slozky schranky. Vraci [{id, path}]."""
if parent_id is None:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders"
else:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{parent_id}/childFolders"
folders = []
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
while url:
data = graph_get(url, params)
for f in data.get("value", []):
path = f"{parent_path}/{f['displayName']}".lstrip("/")
folders.append({"id": f["id"], "path": path})
if f.get("childFolderCount", 0) > 0:
folders.extend(get_all_folders(f["id"], path))
url = data.get("@odata.nextLink")
params = None
return folders
def iter_folder_messages(folder_id: str, select: str = MSG_SELECT, expand_attachments: bool = True):
"""Generator: vraci zpravy ze slozky po strankach."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{folder_id}/messages"
params = {"$top": PAGE_SIZE, "$select": select}
if expand_attachments:
params["$expand"] = ATT_EXPAND
while url:
data = graph_get(url, params)
for msg in data.get("value", []):
yield msg
url = data.get("@odata.nextLink")
params = None
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def parse_date(raw) -> Optional[datetime]:
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
def normalize_subject(subject: str) -> str:
s = subject.strip()
while True:
m = RE_SUBJECT.match(s)
if not m:
break
s = s[m.end():].strip()
return s
def parse_headers(raw_headers: list) -> dict:
result = {}
for h in raw_headers:
k = h["name"].lower().replace("-", "_")
v = h["value"]
if k in result:
existing = result[k]
result[k] = existing + [v] if isinstance(existing, list) else [existing, v]
else:
result[k] = v
return result
def format_recipients(lst: list) -> str:
return "; ".join(
f'{r["emailAddress"].get("name", "")} <{r["emailAddress"].get("address", "")}>'.strip()
for r in lst
)
# ─── Extrakce zprávy ─────────────────────────────────────────────────────────
def extract_message(msg: dict, folder_path: str) -> Optional[dict]:
"""Plna extrakce — pouziva se pro mode full a nove zpravy v sync/new-only."""
try:
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
subject = msg.get("subject") or ""
body_html = None
body_preview = msg.get("bodyPreview") or ""
body = msg.get("body", {})
if body.get("contentType") == "html":
content = body.get("content") or ""
body_html = content if len(content) <= 2 * 1024 * 1024 else content[:2 * 1024 * 1024]
elif body.get("contentType") == "text":
body_preview = (body.get("content") or "")[:2000]
sender_ea = (msg.get("from") or msg.get("sender") or {}).get("emailAddress", {})
to_list = msg.get("toRecipients", [])
cc_list = msg.get("ccRecipients", [])
bcc_list = msg.get("bccRecipients", [])
recipients = (
[{"type": "to", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in to_list] +
[{"type": "cc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in cc_list] +
[{"type": "bcc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in bcc_list]
)
importance = IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1)
flag_status = FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0)
raw_headers = msg.get("internetMessageHeaders") or []
headers = parse_headers(raw_headers)
in_reply_to = headers.get("in_reply_to", "")
if isinstance(in_reply_to, list):
in_reply_to = in_reply_to[0]
refs_raw = headers.get("references", "")
if isinstance(refs_raw, list):
refs_raw = " ".join(refs_raw)
internet_refs = [r.strip() for r in refs_raw.split() if r.strip()] if refs_raw else []
conv_topic = headers.get("thread_topic", "")
if isinstance(conv_topic, list):
conv_topic = conv_topic[0]
conv_index = ""
ci_raw = msg.get("conversationIndex")
if ci_raw:
try:
conv_index = base64.b64encode(base64.b64decode(ci_raw)).decode()
except Exception:
conv_index = ci_raw
attachments = []
for att in msg.get("attachments") or []:
fname = att.get("name") or ""
if not fname:
continue
attachments.append({
"filename": fname,
"size_bytes": att.get("size", 0),
"mime_type": att.get("contentType", "application/octet-stream"),
"is_inline": att.get("isInline", False),
"graph_att_id": att.get("id"),
})
return {
"_id": mid,
"graph_id": msg["id"],
"subject": subject,
"normalized_subject": normalize_subject(subject),
"importance": importance,
"flag_status": flag_status,
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"has_attachments": msg.get("hasAttachments", False),
"attachment_count": len(attachments),
"inference_classification": msg.get("inferenceClassification", ""),
"categories": msg.get("categories") or [],
"conversation_id": msg.get("conversationId", ""),
"conversation_index": conv_index,
"conversation_topic": conv_topic,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"received_at": parse_date(msg.get("receivedDateTime")),
"sent_at": parse_date(msg.get("sentDateTime")),
"created_at": parse_date(msg.get("createdDateTime")),
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"sender": {
"email": sender_ea.get("address", ""),
"name": sender_ea.get("name", ""),
},
"to": format_recipients(to_list),
"cc": format_recipients(cc_list),
"bcc": format_recipients(bcc_list),
"recipients": recipients,
"body_html": body_html,
"body_preview": body_preview,
"attachments": attachments,
"headers": headers,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg.get("id", "?"), e)
return None
def extract_sync_fields(msg: dict, folder_path: str) -> dict:
"""Jen menitelna pole — pouziva se v sync mode pro existujici zpravy."""
return {
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"flag_status": FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0),
"importance": IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1),
"categories": msg.get("categories") or [],
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
def create_indexes(col):
print(" Vytvarim indexy...")
col.create_index([("received_at", ASCENDING)])
col.create_index([("sent_at", ASCENDING)])
col.create_index([("sender.email", ASCENDING)])
col.create_index([("graph_id", ASCENDING)], unique=True, sparse=True)
col.create_index([("conversation_id", ASCENDING)])
col.create_index([("folder_path", ASCENDING)])
col.create_index([("has_attachments", ASCENDING)])
col.create_index([("categories", ASCENDING)])
col.create_index([("importance", ASCENDING)])
col.create_index([("flag_status", ASCENDING)])
col.create_index([("is_read", ASCENDING)])
col.create_index([
("subject", TEXT),
("body_preview", TEXT),
("to", TEXT),
("cc", TEXT),
], name="text_search", default_language="none")
print(" Indexy hotovy.")
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"parse_emails_graph v{SCRIPT_VERSION}")
ap.add_argument("--mode", default="full", choices=["full", "new-only", "sync"],
help="full=plny upsert (vychozi) | new-only=jen nove zpravy | "
"sync=existujici aktualizuje jen menitelna pole, nove importuje cely")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N zprav (0 = vse)")
ap.add_argument("--folder", default="",
help="Zpracovat jen slozku se zadanym nazvem (napr. Inbox)")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
start = datetime.now()
print(f"=== parse_emails_graph v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {GRAPH_MAILBOX}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
print(f"Režim: {args.mode}")
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col = client[MONGO_DB][MONGO_COL]
# Existující _id (potřeba pro new-only a sync)
existing: set = set()
if args.mode in ("new-only", "sync"):
print(" Nacitam existujici zaznamy z MongoDB...")
existing = set(col.distinct("_id"))
print(f" {len(existing)} jiz importovano")
print("\nNacitam seznam slozek...")
all_folders = get_all_folders()
if args.folder:
all_folders = [f for f in all_folders if args.folder.lower() in f["path"].lower()]
print(f" Slozek ke zpracovani: {len(all_folders)}")
for f in all_folders:
print(f" {f['path']}")
# V sync mode fetchujeme jen menitelna pole
is_sync = args.mode == "sync"
msg_select = MSG_SELECT_SYNC if is_sync else MSG_SELECT
expand_att = not is_sync
batch = []
ok_count = 0
sync_count = 0
err_count = 0
skip_count = 0
total_i = 0
def flush():
if not batch:
return
try:
col.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
print()
for folder in all_folders:
print(f"--- Složka: {folder['path']} ---")
folder_count = 0
for msg in iter_folder_messages(folder["id"], select=msg_select, expand_attachments=expand_att):
if args.limit and total_i >= args.limit:
break
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
total_i += 1
folder_count += 1
if args.mode == "new-only" and mid in existing:
skip_count += 1
continue
if is_sync and mid in existing:
# Sync existujici — jen menitelna pole
fields = extract_sync_fields(msg, folder["path"])
batch.append(UpdateOne({"_id": mid}, {"$set": fields}))
sync_count += 1
status = "SYN "
print(f" {total_i:>6} {status} {mid[:80]}")
else:
# Full extract (new-only nove, sync nove, full vse)
# Pro sync nove zpravy potrebujeme plny fetch
if is_sync:
full_url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/messages/{msg['id']}"
full_params = {"$select": MSG_SELECT, "$expand": ATT_EXPAND}
try:
msg = graph_get(full_url, full_params)
except Exception as e:
logging.error("full fetch failed [%s]: %s", msg.get("id","?"), e)
err_count += 1
continue
doc = extract_message(msg, folder["path"])
if doc is None:
err_count += 1
status = "ERR "
print(f" {total_i:>6} {status} {mid[:80]}")
else:
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
ok_count += 1
status = "OK "
subject_str = (doc.get("subject") or "")[:60]
sender_str = (doc.get("sender", {}).get("email") or "")[:40]
print(f" {total_i:>6} {status} {subject_str:<60} {sender_str}")
if len(batch) >= BATCH_SIZE:
flush()
if total_i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = total_i / elapsed if elapsed > 0 else 0
print(f" {''*80}")
print(f" Průběh: ok={ok_count} sync={sync_count} skip={skip_count} err={err_count} {rate:.1f} msg/s")
print(f" {''*80}")
flush()
print(f"{folder_count} zprav ze slozky {folder['path']}")
if args.limit and total_i >= args.limit:
break
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"Vysledek: ok={ok_count} | sync={sync_count} | skip={skip_count} | err={err_count}")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"Dokumentu v kolekci: {col.count_documents({})}")
if not args.no_indexes:
print()
create_indexes(col)
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()