This commit is contained in:
2026-06-02 17:20:20 +02:00
parent ec187e673a
commit b433ef0446
58 changed files with 9247 additions and 0 deletions
@@ -0,0 +1,43 @@
"Protocol","Study Population","Country","Site","Principal Investigator","Participant ID","Baseline Stool Frequency","Visit","Visit Date","Endoscopy Completed?","Endoscopy Date","Bowel Preparation Start Date 1","Bowel Preparation End Date 1","Bowel Preparation Start Date 2","Bowel Preparation End Date 2","Central Endoscopy Score","Local Endoscopy Score","PGA Score","Eligible Day (-1)","Day (-1) Excluded Reason(s)","Eligible Day (-2)","Day (-2) Excluded Reason(s)","Eligible Day (-3)","Day (-3) Excluded Reason(s)","Eligible Day (-4)","Day (-4) Excluded Reason(s)","Eligible Day (-5)","Day (-5) Excluded Reason(s)","Eligible Day (-6)","Day (-6) Excluded Reason(s)","Eligible Day (-7)","Day (-7) Excluded Reason(s)","Eligible Day (-8)","Day (-8) Excluded Reason(s)","Eligible Day (-9)","Day (-9) Excluded Reason(s)","Eligible Day (-10)","Day (-10) Excluded Reason(s)","Eligible Day (-1) Stool Count","Eligible Day (-2) Stool Count","Eligible Day (-3) Stool Count","Eligible Day (-4) Stool Count","Eligible Day (-5) Stool Count","Eligible Day (-6) Stool Count","Eligible Day (-7) Stool Count","Eligible Day (-8) Stool Count","Eligible Day (-9) Stool Count","Eligible Day (-10) Stool Count","Stool Frequency Sub-score","Eligible Day (-1) Rectal Bleeding Score","Eligible Day (-2) Rectal Bleeding Score","Eligible Day (-3) Rectal Bleeding Score","Eligible Day (-4) Rectal Bleeding Score","Eligible Day (-5) Rectal Bleeding Score","Eligible Day (-6) Rectal Bleeding Score","Eligible Day (-7) Rectal Bleeding Score","Eligible Day (-8) Rectal Bleeding Score","Eligible Day (-9) Rectal Bleeding Score","Eligible Day (-10) Rectal Bleeding Score","Rectal Bleeding Sub-score","Partial Mayo Score","Modified Mayo Score","Full Mayo Score","Site Action","Last Mayo Score Submission","Week I-12 Clinical Responder","Week I-12 Clinical Remission","Clinical Flare","Loss of Response","Partial Mayo Response Post Loss of Response","Partial Mayo Response for Clinical Non-Responders"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-0","19 Feb 2026","Yes","05 Feb 2026","04 Feb 2026","04 Feb 2026","-","-","2","-","3","18 Feb 2026","-","17 Feb 2026","-","16 Feb 2026","-","15 Feb 2026","-","14 Feb 2026","-","13 Feb 2026","-","12 Feb 2026","-","11 Feb 2026","Day Not Applicable for Calculation","10 Feb 2026","Day Not Applicable for Calculation","09 Feb 2026","Day Not Applicable for Calculation","10","8","7","5","7","8","8","-","-","-","3","1","1","1","0","1","1","1","-","-","-","1","7","6","9","-","08 Apr 2026 07:11:25","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-2","04 Mar 2026","-","-","-","-","-","-","-","-","3","03 Mar 2026","-","02 Mar 2026","-","01 Mar 2026","-","28 Feb 2026","-","27 Feb 2026","-","26 Feb 2026","-","25 Feb 2026","-","24 Feb 2026","Day Not Applicable for Calculation","23 Feb 2026","Day Not Applicable for Calculation","22 Feb 2026","Day Not Applicable for Calculation","5","4","5","4","5","6","6","-","-","-","2","1","0","1","0","1","0","1","-","-","-","1","6","","","-","28 May 2026 10:04:05","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-4","18 Mar 2026","-","-","-","-","-","-","-","-","2","17 Mar 2026","-","16 Mar 2026","-","15 Mar 2026","-","14 Mar 2026","-","13 Mar 2026","-","12 Mar 2026","-","11 Mar 2026","-","10 Mar 2026","Day Not Applicable for Calculation","09 Mar 2026","Day Not Applicable for Calculation","08 Mar 2026","Day Not Applicable for Calculation","5","5","5","4","5","4","5","-","-","-","2","1","0","0","1","1","1","0","-","-","-","1","5","","","-","08 Apr 2026 07:11:43","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-8","05 May 2026","-","-","-","-","-","-","-","-","1","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","3","3","4","4","5","4","4","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","4","","","-","28 May 2026 14:42:53","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-12","13 May 2026","Yes","06 May 2026","05 May 2026","05 May 2026","-","-","1","-","1","12 May 2026","-","11 May 2026","-","10 May 2026","-","09 May 2026","-","08 May 2026","-","07 May 2026","-","06 May 2026","Endoscopy","05 May 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","04 May 2026","-","03 May 2026","Day Not Applicable for Calculation","5","4","6","5","5","5","-","-","3","-","2","1","0","1","1","1","1","-","-","1","-","1","4","4","5","-","28 May 2026 14:43:11","Clinical Responder","No","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-0","08 Apr 2026","Yes","18 Mar 2026","17 Mar 2026","18 Mar 2026","-","-","2","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","Missing Diary","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","3","3","4","-","3","3","4","-","-","-","1","0","0","0","-","0","0","1","-","-","-","0","3","3","5","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-2","23 Apr 2026","-","-","-","-","-","-","-","-","2","22 Apr 2026","Missing Diary","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","Day Not Applicable for Calculation","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","-","3","3","6","5","5","4","-","-","-","2","-","0","0","1","1","1","1","-","-","-","1","5","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-4","06 May 2026","-","-","-","-","-","-","-","-","1","05 May 2026","-","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","Day Not Applicable for Calculation","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","6","3","2","3","3","3","3","-","-","-","1","1","0","0","0","1","1","0","-","-","-","0","2","","","-","28 May 2026 14:43:38","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012003","1","I-0","27 May 2026","Yes","13 May 2026","12 May 2026","12 May 2026","-","-","3","-","2","26 May 2026","-","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","Day Not Applicable for Calculation","18 May 2026","Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","6","9","7","8","9","7","8","-","-","-","3","2","2","2","2","1","1","1","-","-","-","2","7","8","10","-","27 May 2026 07:24:39","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-0","20 Mar 2026","Yes","19 Feb 2026","-","-","-","-","3","-","3","19 Mar 2026","-","18 Mar 2026","-","17 Mar 2026","-","16 Mar 2026","-","15 Mar 2026","-","14 Mar 2026","-","13 Mar 2026","-","12 Mar 2026","Day Not Applicable for Calculation","11 Mar 2026","Day Not Applicable for Calculation","10 Mar 2026","Day Not Applicable for Calculation","7","7","8","8","7","8","5","-","-","-","3","2","1","1","1","1","1","0","-","-","-","1","7","7","10","-","20 Mar 2026 07:03:23","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-2","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","Medication For Diarrhea","06 Apr 2026","Medication For Diarrhea","05 Apr 2026","Medication For Diarrhea","04 Apr 2026","Medication For Diarrhea","03 Apr 2026","Medication For Diarrhea","02 Apr 2026","Medication For Diarrhea","01 Apr 2026","Medication For Diarrhea","31 Mar 2026","Medication For Diarrhea;Day Not Applicable for Calculation","30 Mar 2026","Medication For Diarrhea;Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","-","-","-","-","-","-","-","-","-","-","Non-Evaluable","-","-","-","-","-","-","-","-","-","-","Non-Evaluable","Non-Evaluable","Non-Evaluable","Non-Evaluable","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-4","15 Apr 2026","-","-","-","-","-","-","-","-","3","14 Apr 2026","-","13 Apr 2026","-","12 Apr 2026","-","11 Apr 2026","-","10 Apr 2026","-","09 Apr 2026","-","08 Apr 2026","-","07 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","06 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","05 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","9","22","20","19","17","18","18","-","-","-","3","1","3","2","2","2","2","2","-","-","-","2","8","","","-","04 May 2026 22:06:03","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-8","18 May 2026","-","-","-","-","-","-","-","-","2","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","-","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","08 May 2026","Day Not Applicable for Calculation","7","5","9","7","7","8","8","-","-","-","3","1","1","1","1","1","1","1","-","-","-","1","6","","","-","29 May 2026 15:44:46","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062002","1","I-0","26 May 2026","Yes","14 May 2026","13 May 2026","13 May 2026","-","-","2","-","2","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","-","18 May 2026","Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","16 May 2026","Day Not Applicable for Calculation","8","8","6","7","7","6","7","-","-","-","3","2","2","2","2","2","2","2","-","-","-","2","7","7","9","-","29 May 2026 15:45:00","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10009","Jiri Pumprla","CZ100092001","1","I-0","05 May 2026","Yes","24 Apr 2026","23 Apr 2026","23 Apr 2026","-","-","2","-","2","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","5","5","5","5","5","5","5","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","5","5","7","-","05 May 2026 11:19:40","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10009","Jiri Pumprla","CZ100092001","1","I-2","19 May 2026","-","-","-","-","-","-","-","-","1","18 May 2026","-","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","Day Not Applicable for Calculation","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","5","4","5","5","5","4","6","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","4","","","-","19 May 2026 10:38:25","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-0","07 Apr 2026","Yes","24 Mar 2026","22 Mar 2026","22 Mar 2026","-","-","2","-","2","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","-","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","28 Mar 2026","Day Not Applicable for Calculation","8","11","5","9","11","10","13","-","-","-","3","1","2","2","2","2","2","2","-","-","-","2","7","7","9","-","04 May 2026 08:44:52","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-2","22 Apr 2026","-","-","-","-","-","-","-","-","2","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","7","5","6","6","7","8","2","-","-","-","1","1","0","1","1","1","2","0","-","-","-","1","4","","","-","04 May 2026 08:45:07","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-4","07 May 2026","-","-","-","-","-","-","-","-","1","06 May 2026","-","05 May 2026","-","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","Day Not Applicable for Calculation","28 Apr 2026","Day Not Applicable for Calculation","27 Apr 2026","Day Not Applicable for Calculation","8","7","7","8","4","11","7","-","-","-","1","2","1","1","1","0","1","1","-","-","-","1","3","","","-","01 Jun 2026 00:57:35","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-0","24 Mar 2026","Yes","12 Mar 2026","11 Mar 2026","11 Mar 2026","-","-","2","-","2","23 Mar 2026","-","22 Mar 2026","-","21 Mar 2026","-","20 Mar 2026","-","19 Mar 2026","-","18 Mar 2026","-","17 Mar 2026","-","16 Mar 2026","Day Not Applicable for Calculation","15 Mar 2026","Day Not Applicable for Calculation","14 Mar 2026","Day Not Applicable for Calculation","8","6","5","7","6","7","6","-","-","-","3","1","1","1","0","1","1","1","-","-","-","1","6","6","8","-","05 Apr 2026 22:41:27","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-2","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","5","2","3","6","5","5","5","-","-","-","2","0","0","0","0","1","1","0","-","-","-","0","4","","","-","28 May 2026 23:19:03","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-4","21 Apr 2026","-","-","-","-","-","-","-","-","0","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","-","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","11 Apr 2026","Day Not Applicable for Calculation","4","3","4","3","3","4","4","-","-","-","2","0","0","0","0","0","0","0","-","-","-","0","2","","","-","27 May 2026 12:54:41","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","1","I-0","12 May 2026","Yes","21 Apr 2026","20 Apr 2026","21 Apr 2026","-","-","2","-","2","11 May 2026","-","10 May 2026","-","09 May 2026","-","08 May 2026","-","07 May 2026","-","06 May 2026","-","05 May 2026","Missing Diary","04 May 2026","Day Not Applicable for Calculation","03 May 2026","Day Not Applicable for Calculation","02 May 2026","Day Not Applicable for Calculation","2","1","1","1","1","2","-","-","-","-","0","0","0","0","0","0","0","-","-","-","-","0","2","2","4","-","28 May 2026 23:19:30","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","1","I-2","26 May 2026","-","-","-","-","-","-","-","-","1","25 May 2026","-","24 May 2026","Missing Diary","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","-","18 May 2026","Missing Diary;Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","16 May 2026","Day Not Applicable for Calculation","1","-","1","2","1","2","2","-","-","-","1","0","-","0","0","0","0","0","-","-","-","0","2","","","-","28 May 2026 23:19:51","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","0","I-0","02 Jun 2026","Yes","25 May 2026","24 May 2026","24 May 2026","-","-","2","-","2","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","Endoscopy;Missing Diary;Day Not Applicable for Calculation","24 May 2026","Bowel Preparation for Procedure;Missing Diary;Day Not Applicable for Calculation","23 May 2026","Missing Diary;Day Not Applicable for Calculation","8","8","11","10","10","11","6","-","-","-","3","2","2","1","2","1","2","2","-","-","-","2","7","7","9","-","02 Jun 2026 08:17:40","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10016","Robert Mudr","CZ100162001","1","I-0","28 May 2026","Yes","19 May 2026","18 May 2026","19 May 2026","-","-","3","-","3","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","Day Not Applicable for Calculation","19 May 2026","Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation","18 May 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","14","15","15","15","15","15","15","-","-","-","3","2","3","3","2","2","3","3","-","-","-","3","9","9","12","-","28 May 2026 10:21:31","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","Unscheduled 1","04 May 2026","Yes","20 Apr 2026","12 Apr 2026","15 Apr 2026","-","-","2","-","3","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","-","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","24 Apr 2026","Day Not Applicable for Calculation","5","6","6","7","6","3","3","-","-","-","2","0","0","0","0","0","0","0","-","-","-","0","5","4","7","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","I-0","18 May 2026","Yes","01 May 2026","01 May 2026","01 May 2026","-","-","2","-","3","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","-","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","08 May 2026","Day Not Applicable for Calculation","6","6","6","6","6","6","6","-","-","-","3","0","0","0","0","0","0","0","-","-","-","0","6","5","8","-","18 May 2026 08:36:37","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","I-2","01 Jun 2026","-","-","-","-","-","-","-","-","3","31 May 2026","-","30 May 2026","Missing Diary","29 May 2026","Missing Diary","28 May 2026","Missing Diary","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","22 May 2026","Day Not Applicable for Calculation","6","-","-","-","6","6","6","-","-","-","3","0","-","-","-","0","0","0","-","-","-","0","6","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-0","07 Apr 2026","Yes","16 Mar 2026","15 Mar 2026","16 Mar 2026","-","-","3","-","3","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","-","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","28 Mar 2026","Day Not Applicable for Calculation","11","11","10","11","11","10","9","-","-","-","3","2","2","2","2","2","2","2","-","-","-","2","8","8","11","-","20 Apr 2026 09:27:58","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-2","20 Apr 2026","-","-","-","-","-","-","-","-","3","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","-","13 Apr 2026","-","12 Apr 2026","Day Not Applicable for Calculation","11 Apr 2026","Day Not Applicable for Calculation","10 Apr 2026","Day Not Applicable for Calculation","8","7","9","8","8","7","8","-","-","-","3","2","2","1","1","1","2","1","-","-","-","1","7","","","-","20 Apr 2026 09:29:01","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-4","05 May 2026","-","-","-","-","-","-","-","-","1","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","6","6","6","6","7","7","6","-","-","-","3","0","0","1","1","1","1","1","-","-","-","1","5","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222002","1","I-0","19 Feb 2026","Yes","11 Feb 2026","10 Feb 2026","11 Feb 2026","-","-","2","-","2","18 Feb 2026","-","17 Feb 2026","-","16 Feb 2026","-","15 Feb 2026","-","14 Feb 2026","-","13 Feb 2026","-","12 Feb 2026","-","11 Feb 2026","Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation","10 Feb 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","09 Feb 2026","Day Not Applicable for Calculation","3","2","2","3","4","3","2","-","-","-","1","1","1","0","0","0","2","2","-","-","-","1","4","4","6","-","19 Feb 2026 15:24:43","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-0","09 Mar 2026","Yes","11 Feb 2026","10 Feb 2026","11 Feb 2026","-","-","2","-","2","08 Mar 2026","-","07 Mar 2026","-","06 Mar 2026","-","05 Mar 2026","-","04 Mar 2026","-","03 Mar 2026","Missing Diary","02 Mar 2026","Missing Diary","01 Mar 2026","Missing Diary;Day Not Applicable for Calculation","28 Feb 2026","Missing Diary;Day Not Applicable for Calculation","27 Feb 2026","Missing Diary;Day Not Applicable for Calculation","7","7","6","6","7","-","-","-","-","-","3","2","2","2","2","2","-","-","-","-","-","2","7","7","9","-","27 Mar 2026 07:27:49","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-2","27 Mar 2026","-","-","-","-","-","-","-","-","2","26 Mar 2026","-","25 Mar 2026","-","24 Mar 2026","-","23 Mar 2026","-","22 Mar 2026","-","21 Mar 2026","-","20 Mar 2026","-","19 Mar 2026","Day Not Applicable for Calculation","18 Mar 2026","Day Not Applicable for Calculation","17 Mar 2026","Day Not Applicable for Calculation","7","3","3","3","5","5","5","-","-","-","2","0","0","1","1","1","1","2","-","-","-","1","5","","","-","08 Apr 2026 07:36:56","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-4","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","3","3","4","4","5","4","3","-","-","-","2","1","0","0","2","1","1","2","-","-","-","1","5","","","-","08 Apr 2026 07:59:35","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-8","04 May 2026","-","-","-","-","-","-","-","-","2","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","-","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","24 Apr 2026","Missing Diary;Day Not Applicable for Calculation","3","5","3","3","3","2","3","-","-","-","1","0","0","0","0","0","0","0","-","-","-","0","3","","","-","04 May 2026 08:08:40","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-12","01 Jun 2026","Yes","20 May 2026","19 May 2026","20 May 2026","-","-","3","-","2","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","22 May 2026","Day Not Applicable for Calculation","4","4","6","3","3","3","3","-","-","-","2","1","1","2","1","1","1","2","-","-","-","1","5","6","8","-","01 Jun 2026 14:25:57","Clinical Nonresponder","No","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-0","09 Apr 2026","Yes","08 Apr 2026","31 Mar 2026","01 Apr 2026","-","-","2","-","2","08 Apr 2026","Endoscopy","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","31 Mar 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","30 Mar 2026","-","-","3","3","4","3","4","3","-","-","3","1","-","2","2","2","2","2","2","-","-","2","2","5","5","7","-","29 May 2026 11:07:08","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-2","22 Apr 2026","-","-","-","-","-","-","-","-","2","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","3","3","5","3","2","3","2","-","-","-","1","1","2","2","1","1","1","2","-","-","-","1","4","","","-","05 May 2026 15:00:39","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-4","05 May 2026","-","-","-","-","-","-","-","-","2","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","4","2","2","2","2","2","2","-","-","-","1","1","1","1","1","2","1","1","-","-","-","1","4","","","-","05 May 2026 07:30:02","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-8","02 Jun 2026","-","-","-","-","-","-","-","-","2","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","Day Not Applicable for Calculation","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","2","2","2","2","2","4","10","-","-","-","1","2","1","2","1","2","2","2","-","-","-","2","5","","","-","02 Jun 2026 08:19:16","N/A","N/A","N/A","N/A","N/A","N/A"
1 Protocol Study Population Country Site Principal Investigator Participant ID Baseline Stool Frequency Visit Visit Date Endoscopy Completed? Endoscopy Date Bowel Preparation Start Date 1 Bowel Preparation End Date 1 Bowel Preparation Start Date 2 Bowel Preparation End Date 2 Central Endoscopy Score Local Endoscopy Score PGA Score Eligible Day (-1) Day (-1) Excluded Reason(s) Eligible Day (-2) Day (-2) Excluded Reason(s) Eligible Day (-3) Day (-3) Excluded Reason(s) Eligible Day (-4) Day (-4) Excluded Reason(s) Eligible Day (-5) Day (-5) Excluded Reason(s) Eligible Day (-6) Day (-6) Excluded Reason(s) Eligible Day (-7) Day (-7) Excluded Reason(s) Eligible Day (-8) Day (-8) Excluded Reason(s) Eligible Day (-9) Day (-9) Excluded Reason(s) Eligible Day (-10) Day (-10) Excluded Reason(s) Eligible Day (-1) Stool Count Eligible Day (-2) Stool Count Eligible Day (-3) Stool Count Eligible Day (-4) Stool Count Eligible Day (-5) Stool Count Eligible Day (-6) Stool Count Eligible Day (-7) Stool Count Eligible Day (-8) Stool Count Eligible Day (-9) Stool Count Eligible Day (-10) Stool Count Stool Frequency Sub-score Eligible Day (-1) Rectal Bleeding Score Eligible Day (-2) Rectal Bleeding Score Eligible Day (-3) Rectal Bleeding Score Eligible Day (-4) Rectal Bleeding Score Eligible Day (-5) Rectal Bleeding Score Eligible Day (-6) Rectal Bleeding Score Eligible Day (-7) Rectal Bleeding Score Eligible Day (-8) Rectal Bleeding Score Eligible Day (-9) Rectal Bleeding Score Eligible Day (-10) Rectal Bleeding Score Rectal Bleeding Sub-score Partial Mayo Score Modified Mayo Score Full Mayo Score Site Action Last Mayo Score Submission Week I-12 Clinical Responder Week I-12 Clinical Remission Clinical Flare Loss of Response Partial Mayo Response Post Loss of Response Partial Mayo Response for Clinical Non-Responders
2 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-0 19 Feb 2026 Yes 05 Feb 2026 04 Feb 2026 04 Feb 2026 - - 2 - 3 18 Feb 2026 - 17 Feb 2026 - 16 Feb 2026 - 15 Feb 2026 - 14 Feb 2026 - 13 Feb 2026 - 12 Feb 2026 - 11 Feb 2026 Day Not Applicable for Calculation 10 Feb 2026 Day Not Applicable for Calculation 09 Feb 2026 Day Not Applicable for Calculation 10 8 7 5 7 8 8 - - - 3 1 1 1 0 1 1 1 - - - 1 7 6 9 - 08 Apr 2026 07:11:25 N/A N/A N/A N/A N/A N/A
3 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-2 04 Mar 2026 - - - - - - - - 3 03 Mar 2026 - 02 Mar 2026 - 01 Mar 2026 - 28 Feb 2026 - 27 Feb 2026 - 26 Feb 2026 - 25 Feb 2026 - 24 Feb 2026 Day Not Applicable for Calculation 23 Feb 2026 Day Not Applicable for Calculation 22 Feb 2026 Day Not Applicable for Calculation 5 4 5 4 5 6 6 - - - 2 1 0 1 0 1 0 1 - - - 1 6 - 28 May 2026 10:04:05 N/A N/A N/A N/A N/A N/A
4 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-4 18 Mar 2026 - - - - - - - - 2 17 Mar 2026 - 16 Mar 2026 - 15 Mar 2026 - 14 Mar 2026 - 13 Mar 2026 - 12 Mar 2026 - 11 Mar 2026 - 10 Mar 2026 Day Not Applicable for Calculation 09 Mar 2026 Day Not Applicable for Calculation 08 Mar 2026 Day Not Applicable for Calculation 5 5 5 4 5 4 5 - - - 2 1 0 0 1 1 1 0 - - - 1 5 - 08 Apr 2026 07:11:43 N/A N/A N/A N/A N/A N/A
5 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-8 05 May 2026 - - - - - - - - 1 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 3 3 4 4 5 4 4 - - - 2 1 1 1 1 1 1 1 - - - 1 4 - 28 May 2026 14:42:53 N/A N/A N/A N/A N/A N/A
6 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-12 13 May 2026 Yes 06 May 2026 05 May 2026 05 May 2026 - - 1 - 1 12 May 2026 - 11 May 2026 - 10 May 2026 - 09 May 2026 - 08 May 2026 - 07 May 2026 - 06 May 2026 Endoscopy 05 May 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 04 May 2026 - 03 May 2026 Day Not Applicable for Calculation 5 4 6 5 5 5 - - 3 - 2 1 0 1 1 1 1 - - 1 - 1 4 4 5 - 28 May 2026 14:43:11 Clinical Responder No N/A N/A N/A N/A
7 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012002 1 I-0 08 Apr 2026 Yes 18 Mar 2026 17 Mar 2026 18 Mar 2026 - - 2 - 2 07 Apr 2026 - 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 Missing Diary 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 Day Not Applicable for Calculation 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 3 3 4 - 3 3 4 - - - 1 0 0 0 - 0 0 1 - - - 0 3 3 5 - - N/A N/A N/A N/A N/A N/A
8 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012002 1 I-2 23 Apr 2026 - - - - - - - - 2 22 Apr 2026 Missing Diary 21 Apr 2026 - 20 Apr 2026 - 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 Day Not Applicable for Calculation 14 Apr 2026 Day Not Applicable for Calculation 13 Apr 2026 Day Not Applicable for Calculation - 3 3 6 5 5 4 - - - 2 - 0 0 1 1 1 1 - - - 1 5 - - N/A N/A N/A N/A N/A N/A
9 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012002 1 I-4 06 May 2026 - - - - - - - - 1 05 May 2026 - 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 Day Not Applicable for Calculation 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 6 3 2 3 3 3 3 - - - 1 1 0 0 0 1 1 0 - - - 0 2 - 28 May 2026 14:43:38 N/A N/A N/A N/A N/A N/A
10 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012003 1 I-0 27 May 2026 Yes 13 May 2026 12 May 2026 12 May 2026 - - 3 - 2 26 May 2026 - 25 May 2026 - 24 May 2026 - 23 May 2026 - 22 May 2026 - 21 May 2026 - 20 May 2026 - 19 May 2026 Day Not Applicable for Calculation 18 May 2026 Day Not Applicable for Calculation 17 May 2026 Day Not Applicable for Calculation 6 9 7 8 9 7 8 - - - 3 2 2 2 2 1 1 1 - - - 2 7 8 10 - 27 May 2026 07:24:39 N/A N/A N/A N/A N/A N/A
11 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-0 20 Mar 2026 Yes 19 Feb 2026 - - - - 3 - 3 19 Mar 2026 - 18 Mar 2026 - 17 Mar 2026 - 16 Mar 2026 - 15 Mar 2026 - 14 Mar 2026 - 13 Mar 2026 - 12 Mar 2026 Day Not Applicable for Calculation 11 Mar 2026 Day Not Applicable for Calculation 10 Mar 2026 Day Not Applicable for Calculation 7 7 8 8 7 8 5 - - - 3 2 1 1 1 1 1 0 - - - 1 7 7 10 - 20 Mar 2026 07:03:23 N/A N/A N/A N/A N/A N/A
12 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-2 08 Apr 2026 - - - - - - - - 2 07 Apr 2026 Medication For Diarrhea 06 Apr 2026 Medication For Diarrhea 05 Apr 2026 Medication For Diarrhea 04 Apr 2026 Medication For Diarrhea 03 Apr 2026 Medication For Diarrhea 02 Apr 2026 Medication For Diarrhea 01 Apr 2026 Medication For Diarrhea 31 Mar 2026 Medication For Diarrhea;Day Not Applicable for Calculation 30 Mar 2026 Medication For Diarrhea;Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation - - - - - - - - - - Non-Evaluable - - - - - - - - - - Non-Evaluable Non-Evaluable Non-Evaluable Non-Evaluable - - N/A N/A N/A N/A N/A N/A
13 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-4 15 Apr 2026 - - - - - - - - 3 14 Apr 2026 - 13 Apr 2026 - 12 Apr 2026 - 11 Apr 2026 - 10 Apr 2026 - 09 Apr 2026 - 08 Apr 2026 - 07 Apr 2026 Medication For Diarrhea;Day Not Applicable for Calculation 06 Apr 2026 Medication For Diarrhea;Day Not Applicable for Calculation 05 Apr 2026 Medication For Diarrhea;Day Not Applicable for Calculation 9 22 20 19 17 18 18 - - - 3 1 3 2 2 2 2 2 - - - 2 8 - 04 May 2026 22:06:03 N/A N/A N/A N/A N/A N/A
14 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-8 18 May 2026 - - - - - - - - 2 17 May 2026 - 16 May 2026 - 15 May 2026 - 14 May 2026 - 13 May 2026 - 12 May 2026 - 11 May 2026 - 10 May 2026 Day Not Applicable for Calculation 09 May 2026 Day Not Applicable for Calculation 08 May 2026 Day Not Applicable for Calculation 7 5 9 7 7 8 8 - - - 3 1 1 1 1 1 1 1 - - - 1 6 - 29 May 2026 15:44:46 N/A N/A N/A N/A N/A N/A
15 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062002 1 I-0 26 May 2026 Yes 14 May 2026 13 May 2026 13 May 2026 - - 2 - 2 25 May 2026 - 24 May 2026 - 23 May 2026 - 22 May 2026 - 21 May 2026 - 20 May 2026 - 19 May 2026 - 18 May 2026 Day Not Applicable for Calculation 17 May 2026 Day Not Applicable for Calculation 16 May 2026 Day Not Applicable for Calculation 8 8 6 7 7 6 7 - - - 3 2 2 2 2 2 2 2 - - - 2 7 7 9 - 29 May 2026 15:45:00 N/A N/A N/A N/A N/A N/A
16 77242113UCO3001 Adult Czech Republic DD5-CZ10009 Jiri Pumprla CZ100092001 1 I-0 05 May 2026 Yes 24 Apr 2026 23 Apr 2026 23 Apr 2026 - - 2 - 2 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 5 5 5 5 5 5 5 - - - 2 1 1 1 1 1 1 1 - - - 1 5 5 7 - 05 May 2026 11:19:40 N/A N/A N/A N/A N/A N/A
17 77242113UCO3001 Adult Czech Republic DD5-CZ10009 Jiri Pumprla CZ100092001 1 I-2 19 May 2026 - - - - - - - - 1 18 May 2026 - 17 May 2026 - 16 May 2026 - 15 May 2026 - 14 May 2026 - 13 May 2026 - 12 May 2026 - 11 May 2026 Day Not Applicable for Calculation 10 May 2026 Day Not Applicable for Calculation 09 May 2026 Day Not Applicable for Calculation 5 4 5 5 5 4 6 - - - 2 1 1 1 1 1 1 1 - - - 1 4 - 19 May 2026 10:38:25 N/A N/A N/A N/A N/A N/A
18 77242113UCO3001 Adult Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 5 I-0 07 Apr 2026 Yes 24 Mar 2026 22 Mar 2026 22 Mar 2026 - - 2 - 2 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 - 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 28 Mar 2026 Day Not Applicable for Calculation 8 11 5 9 11 10 13 - - - 3 1 2 2 2 2 2 2 - - - 2 7 7 9 - 04 May 2026 08:44:52 N/A N/A N/A N/A N/A N/A
19 77242113UCO3001 Adult Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 5 I-2 22 Apr 2026 - - - - - - - - 2 21 Apr 2026 - 20 Apr 2026 - 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 - 14 Apr 2026 Day Not Applicable for Calculation 13 Apr 2026 Day Not Applicable for Calculation 12 Apr 2026 Day Not Applicable for Calculation 7 5 6 6 7 8 2 - - - 1 1 0 1 1 1 2 0 - - - 1 4 - 04 May 2026 08:45:07 N/A N/A N/A N/A N/A N/A
20 77242113UCO3001 Adult Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 5 I-4 07 May 2026 - - - - - - - - 1 06 May 2026 - 05 May 2026 - 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 Day Not Applicable for Calculation 28 Apr 2026 Day Not Applicable for Calculation 27 Apr 2026 Day Not Applicable for Calculation 8 7 7 8 4 11 7 - - - 1 2 1 1 1 0 1 1 - - - 1 3 - 01 Jun 2026 00:57:35 N/A N/A N/A N/A N/A N/A
21 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132001 1 I-0 24 Mar 2026 Yes 12 Mar 2026 11 Mar 2026 11 Mar 2026 - - 2 - 2 23 Mar 2026 - 22 Mar 2026 - 21 Mar 2026 - 20 Mar 2026 - 19 Mar 2026 - 18 Mar 2026 - 17 Mar 2026 - 16 Mar 2026 Day Not Applicable for Calculation 15 Mar 2026 Day Not Applicable for Calculation 14 Mar 2026 Day Not Applicable for Calculation 8 6 5 7 6 7 6 - - - 3 1 1 1 0 1 1 1 - - - 1 6 6 8 - 05 Apr 2026 22:41:27 N/A N/A N/A N/A N/A N/A
22 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132001 1 I-2 08 Apr 2026 - - - - - - - - 2 07 Apr 2026 - 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 Day Not Applicable for Calculation 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 5 2 3 6 5 5 5 - - - 2 0 0 0 0 1 1 0 - - - 0 4 - 28 May 2026 23:19:03 N/A N/A N/A N/A N/A N/A
23 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132001 1 I-4 21 Apr 2026 - - - - - - - - 0 20 Apr 2026 - 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 - 14 Apr 2026 - 13 Apr 2026 Day Not Applicable for Calculation 12 Apr 2026 Day Not Applicable for Calculation 11 Apr 2026 Day Not Applicable for Calculation 4 3 4 3 3 4 4 - - - 2 0 0 0 0 0 0 0 - - - 0 2 - 27 May 2026 12:54:41 N/A N/A N/A N/A N/A N/A
24 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132002 1 I-0 12 May 2026 Yes 21 Apr 2026 20 Apr 2026 21 Apr 2026 - - 2 - 2 11 May 2026 - 10 May 2026 - 09 May 2026 - 08 May 2026 - 07 May 2026 - 06 May 2026 - 05 May 2026 Missing Diary 04 May 2026 Day Not Applicable for Calculation 03 May 2026 Day Not Applicable for Calculation 02 May 2026 Day Not Applicable for Calculation 2 1 1 1 1 2 - - - - 0 0 0 0 0 0 0 - - - - 0 2 2 4 - 28 May 2026 23:19:30 N/A N/A N/A N/A N/A N/A
25 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132002 1 I-2 26 May 2026 - - - - - - - - 1 25 May 2026 - 24 May 2026 Missing Diary 23 May 2026 - 22 May 2026 - 21 May 2026 - 20 May 2026 - 19 May 2026 - 18 May 2026 Missing Diary;Day Not Applicable for Calculation 17 May 2026 Day Not Applicable for Calculation 16 May 2026 Day Not Applicable for Calculation 1 - 1 2 1 2 2 - - - 1 0 - 0 0 0 0 0 - - - 0 2 - 28 May 2026 23:19:51 N/A N/A N/A N/A N/A N/A
26 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132003 0 I-0 02 Jun 2026 Yes 25 May 2026 24 May 2026 24 May 2026 - - 2 - 2 01 Jun 2026 - 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 - 26 May 2026 - 25 May 2026 Endoscopy;Missing Diary;Day Not Applicable for Calculation 24 May 2026 Bowel Preparation for Procedure;Missing Diary;Day Not Applicable for Calculation 23 May 2026 Missing Diary;Day Not Applicable for Calculation 8 8 11 10 10 11 6 - - - 3 2 2 1 2 1 2 2 - - - 2 7 7 9 - 02 Jun 2026 08:17:40 N/A N/A N/A N/A N/A N/A
27 77242113UCO3001 Adult Czech Republic DD5-CZ10016 Robert Mudr CZ100162001 1 I-0 28 May 2026 Yes 19 May 2026 18 May 2026 19 May 2026 - - 3 - 3 27 May 2026 - 26 May 2026 - 25 May 2026 - 24 May 2026 - 23 May 2026 - 22 May 2026 - 21 May 2026 - 20 May 2026 Day Not Applicable for Calculation 19 May 2026 Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation 18 May 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 14 15 15 15 15 15 15 - - - 3 2 3 3 2 2 3 3 - - - 3 9 9 12 - 28 May 2026 10:21:31 N/A N/A N/A N/A N/A N/A
28 77242113UCO3001 Adolescent Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 1 Unscheduled 1 04 May 2026 Yes 20 Apr 2026 12 Apr 2026 15 Apr 2026 - - 2 - 3 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 - 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 24 Apr 2026 Day Not Applicable for Calculation 5 6 6 7 6 3 3 - - - 2 0 0 0 0 0 0 0 - - - 0 5 4 7 - - N/A N/A N/A N/A N/A N/A
29 77242113UCO3001 Adolescent Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 1 I-0 18 May 2026 Yes 01 May 2026 01 May 2026 01 May 2026 - - 2 - 3 17 May 2026 - 16 May 2026 - 15 May 2026 - 14 May 2026 - 13 May 2026 - 12 May 2026 - 11 May 2026 - 10 May 2026 Day Not Applicable for Calculation 09 May 2026 Day Not Applicable for Calculation 08 May 2026 Day Not Applicable for Calculation 6 6 6 6 6 6 6 - - - 3 0 0 0 0 0 0 0 - - - 0 6 5 8 - 18 May 2026 08:36:37 N/A N/A N/A N/A N/A N/A
30 77242113UCO3001 Adolescent Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 1 I-2 01 Jun 2026 - - - - - - - - 3 31 May 2026 - 30 May 2026 Missing Diary 29 May 2026 Missing Diary 28 May 2026 Missing Diary 27 May 2026 - 26 May 2026 - 25 May 2026 - 24 May 2026 Day Not Applicable for Calculation 23 May 2026 Day Not Applicable for Calculation 22 May 2026 Day Not Applicable for Calculation 6 - - - 6 6 6 - - - 3 0 - - - 0 0 0 - - - 0 6 - - N/A N/A N/A N/A N/A N/A
31 77242113UCO3001 Adult Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 1 I-0 07 Apr 2026 Yes 16 Mar 2026 15 Mar 2026 16 Mar 2026 - - 3 - 3 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 - 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 28 Mar 2026 Day Not Applicable for Calculation 11 11 10 11 11 10 9 - - - 3 2 2 2 2 2 2 2 - - - 2 8 8 11 - 20 Apr 2026 09:27:58 N/A N/A N/A N/A N/A N/A
32 77242113UCO3001 Adult Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 1 I-2 20 Apr 2026 - - - - - - - - 3 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 - 14 Apr 2026 - 13 Apr 2026 - 12 Apr 2026 Day Not Applicable for Calculation 11 Apr 2026 Day Not Applicable for Calculation 10 Apr 2026 Day Not Applicable for Calculation 8 7 9 8 8 7 8 - - - 3 2 2 1 1 1 2 1 - - - 1 7 - 20 Apr 2026 09:29:01 N/A N/A N/A N/A N/A N/A
33 77242113UCO3001 Adult Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 1 I-4 05 May 2026 - - - - - - - - 1 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 6 6 6 6 7 7 6 - - - 3 0 0 1 1 1 1 1 - - - 1 5 - - N/A N/A N/A N/A N/A N/A
34 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222002 1 I-0 19 Feb 2026 Yes 11 Feb 2026 10 Feb 2026 11 Feb 2026 - - 2 - 2 18 Feb 2026 - 17 Feb 2026 - 16 Feb 2026 - 15 Feb 2026 - 14 Feb 2026 - 13 Feb 2026 - 12 Feb 2026 - 11 Feb 2026 Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation 10 Feb 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 09 Feb 2026 Day Not Applicable for Calculation 3 2 2 3 4 3 2 - - - 1 1 1 0 0 0 2 2 - - - 1 4 4 6 - 19 Feb 2026 15:24:43 N/A N/A N/A N/A N/A N/A
35 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-0 09 Mar 2026 Yes 11 Feb 2026 10 Feb 2026 11 Feb 2026 - - 2 - 2 08 Mar 2026 - 07 Mar 2026 - 06 Mar 2026 - 05 Mar 2026 - 04 Mar 2026 - 03 Mar 2026 Missing Diary 02 Mar 2026 Missing Diary 01 Mar 2026 Missing Diary;Day Not Applicable for Calculation 28 Feb 2026 Missing Diary;Day Not Applicable for Calculation 27 Feb 2026 Missing Diary;Day Not Applicable for Calculation 7 7 6 6 7 - - - - - 3 2 2 2 2 2 - - - - - 2 7 7 9 - 27 Mar 2026 07:27:49 N/A N/A N/A N/A N/A N/A
36 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-2 27 Mar 2026 - - - - - - - - 2 26 Mar 2026 - 25 Mar 2026 - 24 Mar 2026 - 23 Mar 2026 - 22 Mar 2026 - 21 Mar 2026 - 20 Mar 2026 - 19 Mar 2026 Day Not Applicable for Calculation 18 Mar 2026 Day Not Applicable for Calculation 17 Mar 2026 Day Not Applicable for Calculation 7 3 3 3 5 5 5 - - - 2 0 0 1 1 1 1 2 - - - 1 5 - 08 Apr 2026 07:36:56 N/A N/A N/A N/A N/A N/A
37 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-4 08 Apr 2026 - - - - - - - - 2 07 Apr 2026 - 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 Day Not Applicable for Calculation 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 3 3 4 4 5 4 3 - - - 2 1 0 0 2 1 1 2 - - - 1 5 - 08 Apr 2026 07:59:35 N/A N/A N/A N/A N/A N/A
38 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-8 04 May 2026 - - - - - - - - 2 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 - 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 24 Apr 2026 Missing Diary;Day Not Applicable for Calculation 3 5 3 3 3 2 3 - - - 1 0 0 0 0 0 0 0 - - - 0 3 - 04 May 2026 08:08:40 N/A N/A N/A N/A N/A N/A
39 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-12 01 Jun 2026 Yes 20 May 2026 19 May 2026 20 May 2026 - - 3 - 2 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 - 26 May 2026 - 25 May 2026 - 24 May 2026 Day Not Applicable for Calculation 23 May 2026 Day Not Applicable for Calculation 22 May 2026 Day Not Applicable for Calculation 4 4 6 3 3 3 3 - - - 2 1 1 2 1 1 1 2 - - - 1 5 6 8 - 01 Jun 2026 14:25:57 Clinical Nonresponder No N/A N/A N/A N/A
40 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 1 I-0 09 Apr 2026 Yes 08 Apr 2026 31 Mar 2026 01 Apr 2026 - - 2 - 2 08 Apr 2026 Endoscopy 07 Apr 2026 - 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 31 Mar 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 30 Mar 2026 - - 3 3 4 3 4 3 - - 3 1 - 2 2 2 2 2 2 - - 2 2 5 5 7 - 29 May 2026 11:07:08 N/A N/A N/A N/A N/A N/A
41 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 1 I-2 22 Apr 2026 - - - - - - - - 2 21 Apr 2026 - 20 Apr 2026 - 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 - 14 Apr 2026 Day Not Applicable for Calculation 13 Apr 2026 Day Not Applicable for Calculation 12 Apr 2026 Day Not Applicable for Calculation 3 3 5 3 2 3 2 - - - 1 1 2 2 1 1 1 2 - - - 1 4 - 05 May 2026 15:00:39 N/A N/A N/A N/A N/A N/A
42 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 1 I-4 05 May 2026 - - - - - - - - 2 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 4 2 2 2 2 2 2 - - - 1 1 1 1 1 2 1 1 - - - 1 4 - 05 May 2026 07:30:02 N/A N/A N/A N/A N/A N/A
43 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 1 I-8 02 Jun 2026 - - - - - - - - 2 01 Jun 2026 - 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 - 26 May 2026 - 25 May 2026 Day Not Applicable for Calculation 24 May 2026 Day Not Applicable for Calculation 23 May 2026 Day Not Applicable for Calculation 2 2 2 2 2 4 10 - - - 1 2 1 2 1 2 2 2 - - - 2 5 - 02 Jun 2026 08:19:16 N/A N/A N/A N/A N/A N/A
@@ -0,0 +1,6 @@
"Protocol","Country","Site ID","PI_NAME","Subject Number","Age","Data Correction ID","Creation Date UTC","Status","Date of Last Action UTC","Total Open Period","Total Open Time (Days)","Current Status Time (Days)","Type","Next Action Required","Category","Query History","Reason for Change"
"77242113UCO3001_ANALYSIS","Czech Republic The","CZ10001","Falc, Matej","CZ100012001","48 Years","16923867","14-May-2026","Escalated","26-May-2026","8-14 Days","12","4","QUERY","Site","Patient","(3) 15 May 2026 Clario: You can upload scans of your paper ECGs using the Site Upload Tool. ---- Instructions can be found in the ""Reference Materials"" tab of the study portal. Please contact Customer Care for assistance if needed!","Data Checks"
"77242113UCO3001_ANALYSIS","Czech Republic The","CZ10001","Falc, Matej","CZ100012001","48 Years","16567067","22-Jan-2026","Resolved","28-Jan-2026","4-7 Days","4","","QUERY","","Patient","MD Falc","Data Checks"
"77242113UCO3001_ANALYSIS","Czech Republic The","CZ10009","Pumprla, Jiri","CZ100092001","49 Years","16776685","31-Mar-2026","Resolved","13-May-2026","Over 28 Days","29","","QUERY","","Patient","(2) 13 May 2026 Clario: I confirm, that only ONE ECG was collected by mistake.","Data Checks"
"77242113UCO3001_ANALYSIS","Czech Republic The","CZ10021","Bortlik, Martin","CZ100212001","61 Years","16717619","11-Mar-2026","Resolved","28-Apr-2026","Over 28 Days","32","","QUERY","","Patient","(2) 28 Apr 2026 Clario: I confirmed that due to technical problems, the ECG was done only twice","Data Checks"
"77242113UCO3001_ANALYSIS","Czech Republic The","CZ10022","Hrabak, Petr","CZ100222003","39 Years","16945114","21-May-2026","Escalated","27-May-2026","4-7 Days","7","3","DCR","Site","Patient","(6) 27 May 2026 Botdorf, Paul-Daniel: We still do not have any ECGs for any patients at your site with a collection Date/Time of 20-May-2026 at 14:19:34, 14:20:32, 14:21:15. Please review the records in the portal and let us know if anything more is needed. If you see these ECGs, please double check that this is actually the study they are currently in(77242113UCO3001_ANALYSIS).Thank you",""
1 Protocol Country Site ID PI_NAME Subject Number Age Data Correction ID Creation Date UTC Status Date of Last Action UTC Total Open Period Total Open Time (Days) Current Status Time (Days) Type Next Action Required Category Query History Reason for Change
2 77242113UCO3001_ANALYSIS Czech Republic The CZ10001 Falc, Matej CZ100012001 48 Years 16923867 14-May-2026 Escalated 26-May-2026 8-14 Days 12 4 QUERY Site Patient (3) 15 May 2026 Clario: You can upload scans of your paper ECGs using the Site Upload Tool. ---- Instructions can be found in the "Reference Materials" tab of the study portal. Please contact Customer Care for assistance if needed! Data Checks
3 77242113UCO3001_ANALYSIS Czech Republic The CZ10001 Falc, Matej CZ100012001 48 Years 16567067 22-Jan-2026 Resolved 28-Jan-2026 4-7 Days 4 QUERY Patient MD Falc Data Checks
4 77242113UCO3001_ANALYSIS Czech Republic The CZ10009 Pumprla, Jiri CZ100092001 49 Years 16776685 31-Mar-2026 Resolved 13-May-2026 Over 28 Days 29 QUERY Patient (2) 13 May 2026 Clario: I confirm, that only ONE ECG was collected by mistake. Data Checks
5 77242113UCO3001_ANALYSIS Czech Republic The CZ10021 Bortlik, Martin CZ100212001 61 Years 16717619 11-Mar-2026 Resolved 28-Apr-2026 Over 28 Days 32 QUERY Patient (2) 28 Apr 2026 Clario: I confirmed that due to technical problems, the ECG was done only twice Data Checks
6 77242113UCO3001_ANALYSIS Czech Republic The CZ10022 Hrabak, Petr CZ100222003 39 Years 16945114 21-May-2026 Escalated 27-May-2026 4-7 Days 7 3 DCR Site Patient (6) 27 May 2026 Botdorf, Paul-Daniel: We still do not have any ECGs for any patients at your site with a collection Date/Time of 20-May-2026 at 14:19:34, 14:20:32, 14:21:15. Please review the records in the portal and let us know if anything more is needed. If you see these ECGs, please double check that this is actually the study they are currently in(77242113UCO3001_ANALYSIS).Thank you
@@ -0,0 +1,173 @@
"Protocol","Country","Site","PI Name","Subject ID","Age at Informed Consent","Baseline Stool Count","Confirm Baseline Stool Count","Data Correction ID","Creation Date UTC","Status","Description","Date of Last Action UTC","Total Open Period","Total Open Time (Days)","Current Status Time (Days)","Type","Next Action Required","Category","Query History","Reason for Change","Resolution"
"77242113UCO3001","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","48","1","","SW00703544","13-May-2026","Submitted","Please change answer to clinical remision from no to YES (week 12). Entry erros ","20-May-2026","8-14 Days","13","8","Query Active ","Site","New","(1) 20 May 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification Request.
For us to process your request, please let us know the name of the form (with date) with question.
Thank you. ERT/CLARIO Data Coordination Team
","Entry Error",""
"77242113UCO3001","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","79","1","","SW00696586","09-Apr-2026","ReadyForQC","Please correct date of endoscopy to date: 18 March 2026 (from 25 March 2026)","15-Apr-2026","Over 28 Days","35","31","Query Active ","Site","Site-Entered Data","","Entry Error","CLARIO RESOLUTION:
Part 1: In Mayo Subscore (1) dated 08 Apr 2026 for I-0 visit, CLARIO to make the following changes:
- What was the date of endoscopy? (ENDODT1D): from 25 Mar 2026 to 18 Mar 2026
- Data Flag (QSDFLG1B): from blank to check
"
"77242113UCO3001","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","19","1","","SW00704536","19-May-2026","ReadyForQC","Please change the endoscopy date to 19-FEB-2026. 06-MAR-2026 was entered in error. ","26-May-2026","8-14 Days","9","4","Query Active ","Site","Site-Entered Data","","Entry Error","CLARIO RESOLUTION:
Part 1: In Mayo Subscore (1) dated 20 Mar 2026 for I-0 visit, CLARIO to make the following changes:
-What was the date of endoscopy? (ENDODT1D): from 06 Mar 2026 to 19 Feb 2026
- Data Flag (QSDFLG1B): from blank to check
"
"77242113UCO3001","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","22","5","Yes, I confirm this is the correct stool count.","SW00706684","01-Jun-2026","Submitted","The right endoscopy date is 23MAR2026, please change the date","01-Jun-2026","1 Day","1","1","","Clario DM","New","","Entry Error",""
"77242113UCO3001","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","29","1","","SW00705646","26-May-2026","Submitted","Correct visit date I-O is 12-May-2026. All questionaries were filled on paper and entered in tablet later.
Log-in issue. ","01-Jun-2026","4-7 Days","5","1","","Clario DM","New","(1) 01 Jun 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Please provide the timestamps for each of the assessments if you used paper forms and transcribed into the device.
If unknown, ERT will use a dummy timestamp.
Thank you. ERT/CLARIO Data Coordination Team.
(2) 01 Jun 2026 dstepek@vnbrno.cz (Site User): time is unknown
","Changed Information",""
"77242113UCO3001","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","49","0","","SW00706581","29-May-2026","Submitted","baseline stool count reported by subject is 0, please change to 1 as per CRA request (subject has 1 stool in 2-3 days if in remission)","29-May-2026","1 Day","1","1","","Clario DM","New","","Changed Information",""
"77242113UCO3001","Czech Republic","DD5-CZ10016","Robert Mudr","CZ100162001","48","1","","SW00705916","27-May-2026","Submitted","As per ATS investigation (ATS26040111), please remove the below form which was entered as a duplicate
- MAYO Diary (5) 24 Apr 2026","27-May-2026","4-7 Days","4","4","","Clario DM","New","","Technical Revision - Other",""
"77242113UCO3001","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","15","1","","SW00701729","06-May-2026","Completed","Dears, please delete data from visit I-0 (reported as 4th of May 2026) as this visit had to be postponed - see the previous DCR of this patient and change data request that was corrected. Patient has left the site before it was resolved and and new date of I-0 was planned. Patient continues to fill in his diary and patient is coming to I=0 visit within allowed window. We need the system and tablet to be ready to run new Mayo Score Report with updated and recent data (e.g. reflect new I-0 visit date, new eligible days -1 to -7.).
thank you, Jiri Skopek","19-May-2026","8-14 Days","8","","","","Visit Data","(1) 11 May 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Please note that the delete forms are allowed if the reason is one of the following.
If not, forms will move to unscheduled visit.
Data collected by the wrong patient.
Data collected by someone other than the patient.
Data collected prior to informed consent, or after withdrawal from the study.
Duplicate data erroneously entered at an Unscheduled visit via paper transcription.
Data collected that is not expected per protocol.
Also, I-0 visit is still ongoing. Please close the visit.
Once the visit was closed, we will process accoridngly.
Thank you. ERT/CLARIO Data Coordination Team
(2) 11 May 2026 jskopek (Site User): Dears,
I do not see any option that is adequate -from the list. Data are not needed to be deleted fully, they reflect the situation at May4th. Please mark it as unscheduled visit - as exactly that is the case. We need the system to be ready for I-0 visit planned for next week.
I will close the visit tomorrow - do you mean in tablet/ipad?
Thank you very much for your help! Jiri
(3) 12 May 2026 venkata.ramana (Clario): Thank you for your response.
Please note that the visit I-0 was still ongoing but not closed yet.
So please close the visit.
Kind Regards, Clario Data Coordination Team.
(4) 12 May 2026 jskopek (Site User): If I try to close the I-O visit in TABLET, it asks me if patient fulfils eligibility criteria to proceed to next visit based on these old data if I answer NO, it asks me to DEACTIVATE patient. I do not want to DEACTIVATE patient can you help WHERE and HOW to close this visit for you to change it to UNSCHEDULED and not to de-activate patient?
Thank you Jiri
","Other-delete visit I-0","CLARIO RESOLUTION:
Part 1: In the following forms dated 04 May 2026, CLARIO to make the following changes:
-Event ID: from I-0 to Unscheduled Visit 1
-Event At Entry: from I-0 to Unscheduled Visit 1
+Visit Start (49)
+ePRO Availability (1)
+Mayo Subscore (1)
+PGA (1)
Part 2: CLARIO to delete the following forms dated 04 May 2026 for I-0 visit.
+C-SSRS Since Last Visit (1)
+C-SSRS Since Last Visit Findings Report (1)
Part 3: CLARIO to manually enter Visit End form for Unscheduled visit 1 with the following information:
-Protocol: 77242113UCO3001
-Report Date: 04 May 2026
-Report Start Date and Time: 04 May 2026 23:59:59
-Event ID: Unscheduled Visit 1
-Event End Date: 04 May 2026 23:59:59
-Visit Status: Incomplete
-Phase At Entry: Screening
-Phase At Entry Timestamp: 13 Apr 2026 12:32:20
-Event At Entry: Unscheduled visit 1
-Event Start Date: 04 May 2026 23:59:59
-Event Time Zone Offset in Milliseconds: 7200000
-Session Repeat Number (SESREP1N): 0
-Session Instance Id (SESINST1S): 3f1214f0-4788-11f1-a0cf-bb403212adce
"
"77242113UCO3001","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","15","1","","SW00701226","04-May-2026","Completed","Dears, we would like ask you to change the information I read on assignment form given by patient on April 13, 2026 (Visit 1), Baseline Stool Count (PT.Custom4) as 3 that should be reported as 1.
Patient has entered wrong number as he did not understood it should be number of stools when illness is in remission or absent. He is a child and did not reflected this question correctly. Therefore, please change Baseline Stool Count = 1.
Thank you, Jiri Skopek ","04-May-2026","1 Day","1","","","","Demographic","","Changed Information","(Clario instructions)
1. Please make below changes in the assignment form:
Baseline Stool Count (PT. Custom4): 03 to 01."
"77242113UCO3001","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","61","1","","SW00699492","23-Apr-2026","ReadyForQC","Please correct the date of endoscopy done during screening visit of patient CZ100212001 to correct date 16-MAR-2026.","29-Apr-2026","22-28 Days","26","22","Query Active ","Site","Site-Entered Data","","Changed Information","CLARIO RESOLUTION:
Part 1: In the Mayo Subscore (1) dated 07 Apr 2026 for I-0 visit, CLARIO to make the following changes:
-What was the date of endoscopy? (ENDODT1D): from 24 Mar 2026 to 16 Mar 2026
- Data Flag (QSDFLG1B): from blank to check
"
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","39","1","","SW00703322","12-May-2026","Completed","As per ATS investigation (ATS26040111), please remove the below form that's been entered as a duplicate
- MAYO Diary (16) - 18 Mar 2026
","20-May-2026","4-7 Days","6","","","","Technical Revision","","Technical Revision - Other","CLARIO RESOLUTION:
Part 1: CLARIO to delete the MAYO Diary (16) dated 18 Mar 2026.
"
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","39","1","","SW00689748","09-Mar-2026","Completed","Dear all,
Patient CZ 100222003 was randomized on 9 Mar 2026. Kindly correct the colonoscopy date to 11 Feb 2025.
The date was initially entered as 21 Feb 2025 because the earlier date could not be entered in the system. The patient was rescreened.","02-Apr-2026","15-21 Days","17","","","","Site-Entered Data","(1) 13 Mar 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Could you please conform that if you are requesting following?
Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit
-What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025
Could you please confirm the year? This subject was assigned on 02 Mar 2026, you are providing that correct date is 11 Feb 2025 which a year ago.
If you are not requesting above, please provide us the name of the form with question.
Thank you. ERT/CLARIO Data Coordination Team
(2) 13 Mar 2026 katerina.havlikova@clinoxus.com (Site User): confirm date of colonoscopy 11Feb2026
(3) 21 Mar 2026 msullivan (Clario): Dear Site,
The requested changes to the Mayo data have been updated. Please navigate to the Mayo Score Report and resubmit the form for visit to log the updated Mayo Score form. Once done, please respond to this query confirming that the Mayo Score has been resubmitted.
Thank you. ERT/CLARIO Data Coordination Team
(4) 24 Mar 2026 jana.pomahacova@clinoxus.com (Site User): Thank you and sent
","New Information","CLARIO RESOLUTION:
Part 1: In the Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit, CLARIO to make the following changes:
-What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025
-Data Flag (QSDFLG1B): from blank to check"
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","33","1","","SW00705372","22-May-2026","Submitted","Dear all, please change Colonoscopz date from 8April2026 to date 01Apr2026 Thank you in advance","29-May-2026","4-7 Days","6","1","Query Active ","Site","New","(1) 29 May 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Please provide us the name of the form for this request.
Thank you. ERT/CLARIO Data Coordination Team
","Changed Information",""
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","33","1","","SW00702538","08-May-2026","Completed","This TRR is to document the correction to the Mayo Subscore (1) form, where the following variables were populated with NULL values, due to a known core defect:
Event At Entry, Event Start Date, Event Time Zone Offset in Milliseconds.","12-May-2026","2-3 Days","2","","","","Technical Revision","","Technical Revision - Other","Please make the below changes in Mayo Subscore (1) dated 22 Apr 2026:
-Event At Entry: I-0
-Event Start Date: 09 Apr 2026 08:09:19
-Event Time Zone Offset in Milliseconds: 7200000"
1 Protocol Country Site PI Name Subject ID Age at Informed Consent Baseline Stool Count Confirm Baseline Stool Count Data Correction ID Creation Date UTC Status Description Date of Last Action UTC Total Open Period Total Open Time (Days) Current Status Time (Days) Type Next Action Required Category Query History Reason for Change Resolution
2 77242113UCO3001 Czech Republic DD5-CZ10001 Matej Falc CZ100012001 48 1 SW00703544 13-May-2026 Submitted Please change answer to clinical remision from no to YES (week 12). Entry erros 20-May-2026 8-14 Days 13 8 Query Active Site New (1) 20 May 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification Request. For us to process your request, please let us know the name of the form (with date) with question. Thank you. ERT/CLARIO Data Coordination Team Entry Error
3 77242113UCO3001 Czech Republic DD5-CZ10001 Matej Falc CZ100012002 79 1 SW00696586 09-Apr-2026 ReadyForQC Please correct date of endoscopy to date: 18 March 2026 (from 25 March 2026) 15-Apr-2026 Over 28 Days 35 31 Query Active Site Site-Entered Data Entry Error CLARIO RESOLUTION: Part 1: In Mayo Subscore (1) dated 08 Apr 2026 for I-0 visit, CLARIO to make the following changes: - What was the date of endoscopy? (ENDODT1D): from 25 Mar 2026 to 18 Mar 2026 - Data Flag (QSDFLG1B): from blank to check
4 77242113UCO3001 Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 19 1 SW00704536 19-May-2026 ReadyForQC Please change the endoscopy date to 19-FEB-2026. 06-MAR-2026 was entered in error. 26-May-2026 8-14 Days 9 4 Query Active Site Site-Entered Data Entry Error CLARIO RESOLUTION: Part 1: In Mayo Subscore (1) dated 20 Mar 2026 for I-0 visit, CLARIO to make the following changes: -What was the date of endoscopy? (ENDODT1D): from 06 Mar 2026 to 19 Feb 2026 - Data Flag (QSDFLG1B): from blank to check
5 77242113UCO3001 Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 22 5 Yes, I confirm this is the correct stool count. SW00706684 01-Jun-2026 Submitted The right endoscopy date is 23MAR2026, please change the date 01-Jun-2026 1 Day 1 1 Clario DM New Entry Error
6 77242113UCO3001 Czech Republic DD5-CZ10013 David Stepek CZ100132002 29 1 SW00705646 26-May-2026 Submitted Correct visit date I-O is 12-May-2026. All questionaries were filled on paper and entered in tablet later. Log-in issue. 01-Jun-2026 4-7 Days 5 1 Clario DM New (1) 01 Jun 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Please provide the timestamps for each of the assessments if you used paper forms and transcribed into the device. If unknown, ERT will use a dummy timestamp. Thank you. ERT/CLARIO Data Coordination Team. (2) 01 Jun 2026 dstepek@vnbrno.cz (Site User): time is unknown Changed Information
7 77242113UCO3001 Czech Republic DD5-CZ10013 David Stepek CZ100132003 49 0 SW00706581 29-May-2026 Submitted baseline stool count reported by subject is 0, please change to 1 as per CRA request (subject has 1 stool in 2-3 days if in remission) 29-May-2026 1 Day 1 1 Clario DM New Changed Information
8 77242113UCO3001 Czech Republic DD5-CZ10016 Robert Mudr CZ100162001 48 1 SW00705916 27-May-2026 Submitted As per ATS investigation (ATS26040111), please remove the below form which was entered as a duplicate - MAYO Diary (5) 24 Apr 2026 27-May-2026 4-7 Days 4 4 Clario DM New Technical Revision - Other
9 77242113UCO3001 Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 15 1 SW00701729 06-May-2026 Completed Dears, please delete data from visit I-0 (reported as 4th of May 2026) as this visit had to be postponed - see the previous DCR of this patient and change data request that was corrected. Patient has left the site before it was resolved and and new date of I-0 was planned. Patient continues to fill in his diary and patient is coming to I=0 visit within allowed window. We need the system and tablet to be ready to run new Mayo Score Report with updated and recent data (e.g. reflect new I-0 visit date, new eligible days -1 to -7.). thank you, Jiri Skopek 19-May-2026 8-14 Days 8 Visit Data (1) 11 May 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Please note that the delete forms are allowed if the reason is one of the following. If not, forms will move to unscheduled visit. Data collected by the wrong patient. Data collected by someone other than the patient. Data collected prior to informed consent, or after withdrawal from the study. Duplicate data erroneously entered at an Unscheduled visit via paper transcription. Data collected that is not expected per protocol. Also, I-0 visit is still ongoing. Please close the visit. Once the visit was closed, we will process accoridngly. Thank you. ERT/CLARIO Data Coordination Team (2) 11 May 2026 jskopek (Site User): Dears, I do not see any option that is adequate -from the list. Data are not needed to be deleted fully, they reflect the situation at May4th. Please mark it as unscheduled visit - as exactly that is the case. We need the system to be ready for I-0 visit planned for next week. I will close the visit tomorrow - do you mean in tablet/ipad? Thank you very much for your help! Jiri (3) 12 May 2026 venkata.ramana (Clario): Thank you for your response. Please note that the visit I-0 was still ongoing but not closed yet. So please close the visit. Kind Regards, Clario Data Coordination Team. (4) 12 May 2026 jskopek (Site User): If I try to close the I-O visit in TABLET, it asks me if patient fulfils eligibility criteria to proceed to next visit based on these old data – if I answer NO, it asks me to DEACTIVATE patient. I do not want to DEACTIVATE patient – can you help WHERE and HOW to close this visit for you to change it to UNSCHEDULED and not to de-activate patient? Thank you Jiri Other-delete visit I-0 CLARIO RESOLUTION: Part 1: In the following forms dated 04 May 2026, CLARIO to make the following changes: -Event ID: from I-0 to Unscheduled Visit 1 -Event At Entry: from I-0 to Unscheduled Visit 1 +Visit Start (49) +ePRO Availability (1) +Mayo Subscore (1) +PGA (1) Part 2: CLARIO to delete the following forms dated 04 May 2026 for I-0 visit. +C-SSRS Since Last Visit (1) +C-SSRS Since Last Visit Findings Report (1) Part 3: CLARIO to manually enter Visit End form for Unscheduled visit 1 with the following information: -Protocol: 77242113UCO3001 -Report Date: 04 May 2026 -Report Start Date and Time: 04 May 2026 23:59:59 -Event ID: Unscheduled Visit 1 -Event End Date: 04 May 2026 23:59:59 -Visit Status: Incomplete -Phase At Entry: Screening -Phase At Entry Timestamp: 13 Apr 2026 12:32:20 -Event At Entry: Unscheduled visit 1 -Event Start Date: 04 May 2026 23:59:59 -Event Time Zone Offset in Milliseconds: 7200000 -Session Repeat Number (SESREP1N): 0 -Session Instance Id (SESINST1S): 3f1214f0-4788-11f1-a0cf-bb403212adce
10 77242113UCO3001 Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 15 1 SW00701226 04-May-2026 Completed Dears, we would like ask you to change the information I read on assignment form given by patient on April 13, 2026 (Visit 1), Baseline Stool Count (PT.Custom4) as 3 that should be reported as 1. Patient has entered wrong number as he did not understood it should be number of stools when illness is in remission or absent. He is a child and did not reflected this question correctly. Therefore, please change Baseline Stool Count = 1. Thank you, Jiri Skopek 04-May-2026 1 Day 1 Demographic Changed Information (Clario instructions) 1. Please make below changes in the assignment form: Baseline Stool Count (PT. Custom4): 03 to 01.
11 77242113UCO3001 Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 61 1 SW00699492 23-Apr-2026 ReadyForQC Please correct the date of endoscopy done during screening visit of patient CZ100212001 to correct date 16-MAR-2026. 29-Apr-2026 22-28 Days 26 22 Query Active Site Site-Entered Data Changed Information CLARIO RESOLUTION: Part 1: In the Mayo Subscore (1) dated 07 Apr 2026 for I-0 visit, CLARIO to make the following changes: -What was the date of endoscopy? (ENDODT1D): from 24 Mar 2026 to 16 Mar 2026 - Data Flag (QSDFLG1B): from blank to check
12 77242113UCO3001 Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 39 1 SW00703322 12-May-2026 Completed As per ATS investigation (ATS26040111), please remove the below form that's been entered as a duplicate - MAYO Diary (16) - 18 Mar 2026 20-May-2026 4-7 Days 6 Technical Revision Technical Revision - Other CLARIO RESOLUTION: Part 1: CLARIO to delete the MAYO Diary (16) dated 18 Mar 2026.
13 77242113UCO3001 Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 39 1 SW00689748 09-Mar-2026 Completed Dear all, Patient CZ 100222003 was randomized on 9 Mar 2026. Kindly correct the colonoscopy date to 11 Feb 2025. The date was initially entered as 21 Feb 2025 because the earlier date could not be entered in the system. The patient was rescreened. 02-Apr-2026 15-21 Days 17 Site-Entered Data (1) 13 Mar 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Could you please conform that if you are requesting following? Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit -What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025 Could you please confirm the year? This subject was assigned on 02 Mar 2026, you are providing that correct date is 11 Feb 2025 which a year ago. If you are not requesting above, please provide us the name of the form with question. Thank you. ERT/CLARIO Data Coordination Team (2) 13 Mar 2026 katerina.havlikova@clinoxus.com (Site User): confirm date of colonoscopy 11Feb2026 (3) 21 Mar 2026 msullivan (Clario): Dear Site, The requested changes to the Mayo data have been updated. Please navigate to the Mayo Score Report and resubmit the form for visit to log the updated Mayo Score form. Once done, please respond to this query confirming that the Mayo Score has been resubmitted. Thank you. ERT/CLARIO Data Coordination Team (4) 24 Mar 2026 jana.pomahacova@clinoxus.com (Site User): Thank you and sent New Information CLARIO RESOLUTION: Part 1: In the Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit, CLARIO to make the following changes: -What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025 -Data Flag (QSDFLG1B): from blank to check
14 77242113UCO3001 Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 33 1 SW00705372 22-May-2026 Submitted Dear all, please change Colonoscopz date from 8April2026 to date 01Apr2026 Thank you in advance 29-May-2026 4-7 Days 6 1 Query Active Site New (1) 29 May 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Please provide us the name of the form for this request. Thank you. ERT/CLARIO Data Coordination Team Changed Information
15 77242113UCO3001 Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 33 1 SW00702538 08-May-2026 Completed This TRR is to document the correction to the Mayo Subscore (1) form, where the following variables were populated with NULL values, due to a known core defect: Event At Entry, Event Start Date, Event Time Zone Offset in Milliseconds. 12-May-2026 2-3 Days 2 Technical Revision Technical Revision - Other Please make the below changes in Mayo Subscore (1) dated 22 Apr 2026: -Event At Entry: I-0 -Event Start Date: 09 Apr 2026 08:09:19 -Event Time Zone Offset in Milliseconds: 7200000
+648
View File
@@ -0,0 +1,648 @@
"""
create_report.py
Verze: 1.6
Datum: 2026-06-02
Generuje Excel report (.xlsm) pro studii 77242113UCO3001 z MongoDB databáze Clario.
Výstup: U:/Dropbox/!!!Days/Downloads Z230/YYYY-MM-DD 77242113UCO3001 Clario Reports.xlsm
Zdroj dat:
MongoDB 192.168.1.76, databáze Clario
Kolekce Clario.MayoScore — skóre Mayo per pacient × visit
Kolekce Clario.MayoDiary — denní záznamy deníku pacienta
Kolekce Clario.eCOA_DCRs — data correction requests eCOA
Kolekce Clario.ECG_DCRs — data correction requests ECG
Listy:
MayoScore — jeden řádek = pacient × visit
sloupec „KLIKNI SEM" naviguje na filtrovaný EligibleDays
řádky I-0 s Modified Mayo < 5 červeně tučně
MayoDiary — jeden řádek = denní záznam deníku pacienta
EligibleDays — jeden řádek = jeden eligible day z MayoScore obohacený o data z MayoDiary;
included/excluded flag, excluded dny šedě na žlutém pozadí
eCOA_DCRs — všechna pole z kolekce Clario.eCOA_DCRs
ECG_DCRs — všechna pole z kolekce Clario.ECG_DCRs
VBA makro (Worksheet_SelectionChange na listu MayoScore):
Klik na sloupec „KLIKNI SEM" → přepne na EligibleDays a vyfiltruje záznamy
pro daného pacienta a visit. Vyžaduje povolení maker při otevření souboru.
"""
VERSION = "1.6"
from datetime import datetime
from pathlib import Path
import time
from pymongo import MongoClient
from openpyxl import Workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
import xlwings as xw
# ---------------------------------------------------------------------------
# Konfigurace
# ---------------------------------------------------------------------------
MONGO_URI = "mongodb://192.168.1.76:27017"
DB_NAME = "Clario"
OUTPUT_DIR = Path(r"U:\Dropbox\!!!Days\Downloads Z230")
VISIT_ORDER = ["I-0", "I-2", "I-4", "I-8", "I-12"]
COLUMNS_SCORE = [
("KLIKNI SEM", lambda d: "▶ klikni sem"),
("Site", lambda d: d.get("site", {}).get("name", "")),
("Subject ID", lambda d: d.get("subject", {}).get("id", "")),
("Visit", lambda d: d["fields"].get("Visit", "")),
("Visit Date", lambda d: d["fields"].get("Visit Date", "")),
("Baseline Stool Frequency", lambda d: _num(d["fields"].get("Baseline Stool Frequency", ""))),
("Central Endoscopy Score", lambda d: _num(d["fields"].get("Central Endoscopy Score", ""))),
("PGA Score", lambda d: _num(d["fields"].get("PGA Score", ""))),
("Stool Frequency Sub-score", lambda d: _num(d["fields"].get("Stool Frequency Sub-score", ""))),
("Rectal Bleeding Sub-score", lambda d: _num(d["fields"].get("Rectal Bleeding Sub-score", ""))),
("Partial Mayo Score", lambda d: _num(d["fields"].get("Partial Mayo Score", ""))),
("Modified Mayo Score", lambda d: _num(d["fields"].get("Modified Mayo Score", ""))),
("Full Mayo Score", lambda d: _num(d["fields"].get("Full Mayo Score", ""))),
("Site Action", lambda d: d.get("Site Action") or ""),
("Last Mayo Score Submission", lambda d: d.get("Last Mayo Score Submission") or ""),
("Wk I-12 Responder", lambda d: d.get("Week I-12 Clinical Responder") or ""),
("Wk I-12 Remission", lambda d: d.get("Week I-12 Clinical Remission") or ""),
("Clinical Flare", lambda d: d.get("Clinical Flare") or ""),
("Loss of Response", lambda d: d.get("Loss of Response") or ""),
("Partial Mayo Post LoR", lambda d: d.get("Partial Mayo Response Post Loss of Response") or ""),
("Partial Mayo Non-Resp", lambda d: d.get("Partial Mayo Response for Clinical Non-Responders") or ""),
]
COLUMNS_DIARY = [
("Subject ID", lambda d: d.get("subject", {}).get("id", "")),
("Report Date", lambda d: d["fields"].get("Report Date", "")),
("Baseline Stool Count", lambda d: _num(d["fields"].get("Baseline Stool Count", ""))),
("Stool Frequency", lambda d: _num(d["fields"].get("Stool Frequency", ""))),
("MAYO050", lambda d: d["fields"].get("MAYO050", "")),
("Not Applicable", lambda d: d["fields"].get("Not Applicable", "")),
("Constipation", lambda d: d["fields"].get("Constipation", "")),
("Diarrhea", lambda d: d["fields"].get("Diarrhea", "")),
("Irregularity", lambda d: d["fields"].get("Irregularity", "")),
]
COLUMNS_ECOA_DCRS = [
("Site", lambda d: d.get("site", {}).get("name", "")),
("Subject ID", lambda d: d.get("subject", {}).get("id", "")),
("Data Correction ID", lambda d: d["fields"].get("Data Correction ID", "")),
("PI Name", lambda d: d["fields"].get("PI Name", "")),
("Creation Date UTC", lambda d: d["fields"].get("Creation Date UTC", "")),
("Date of Last Action UTC", lambda d: d["fields"].get("Date of Last Action UTC", "")),
("Status", lambda d: d["fields"].get("Status", "")),
("Type", lambda d: d["fields"].get("Type", "")),
("Next Action Required", lambda d: d["fields"].get("Next Action Required", "")),
("Category", lambda d: d["fields"].get("Category", "")),
("Total Open Period", lambda d: d["fields"].get("Total Open Period", "")),
("Total Open Time (Days)", lambda d: _num(d["fields"].get("Total Open Time (Days)", ""))),
("Current Status Time (Days)", lambda d: _num(d["fields"].get("Current Status Time (Days)", ""))),
("Reason for Change", lambda d: d["fields"].get("Reason for Change", "")),
("Description", lambda d: d["fields"].get("Description", "")),
("Resolution", lambda d: d["fields"].get("Resolution", "")),
("Query History", lambda d: d["fields"].get("Query History", "")),
("Age at Informed Consent", lambda d: d["fields"].get("Age at Informed Consent", "")),
("Baseline Stool Count", lambda d: _num(d["fields"].get("Baseline Stool Count", ""))),
("firstSeen", lambda d: d.get("firstSeen", "")),
("lastSeen", lambda d: d.get("lastSeen", "")),
]
COLUMNS_ECG_DCRS = [
("Site ID", lambda d: d.get("site", {}).get("name", "")),
("Subject Number", lambda d: d.get("subject", {}).get("id", "")),
("Data Correction ID", lambda d: d["fields"].get("Data Correction ID", "")),
("PI Name", lambda d: d["fields"].get("PI_NAME", "")),
("Age", lambda d: d["fields"].get("Age", "")),
("Creation Date UTC", lambda d: d["fields"].get("Creation Date UTC", "")),
("Date of Last Action UTC", lambda d: d["fields"].get("Date of Last Action UTC", "")),
("Status", lambda d: d["fields"].get("Status", "")),
("Type", lambda d: d["fields"].get("Type", "")),
("Next Action Required", lambda d: d["fields"].get("Next Action Required", "")),
("Category", lambda d: d["fields"].get("Category", "")),
("Total Open Period", lambda d: d["fields"].get("Total Open Period", "")),
("Total Open Time (Days)", lambda d: _num(d["fields"].get("Total Open Time (Days)", ""))),
("Current Status Time (Days)", lambda d: _num(d["fields"].get("Current Status Time (Days)", ""))),
("Reason for Change", lambda d: d["fields"].get("Reason for Change", "")),
("Query History", lambda d: d["fields"].get("Query History", "")),
("firstSeen", lambda d: d.get("firstSeen", "")),
("lastSeen", lambda d: d.get("lastSeen", "")),
]
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _num(value):
"""Převede číselný string na int, jinak vrátí původní hodnotu nebo None."""
if value == "" or value is None:
return None
try:
return int(value)
except (ValueError, TypeError):
try:
return float(value)
except (ValueError, TypeError):
return value
def _visit_sort_key(doc):
visit = doc["fields"].get("Visit", "")
try:
idx = VISIT_ORDER.index(visit)
except ValueError:
idx = len(VISIT_ORDER)
return (doc.get("site", {}).get("name", ""), doc.get("subject", {}).get("id", ""), idx, visit)
def _iso_to_date(value):
"""ISO string → Python date pro Excel."""
if not isinstance(value, str):
return value
try:
return datetime.fromisoformat(value).date()
except ValueError:
return value
# ---------------------------------------------------------------------------
# Styly
# ---------------------------------------------------------------------------
HEADER_FILL = PatternFill("solid", fgColor="1F497D")
HEADER_FONT = Font(bold=True, color="FFFFFF", size=10)
CELL_FONT = Font(size=10)
ALIGN_CTR = Alignment(horizontal="center", vertical="center", wrap_text=False)
ALIGN_LEFT = Alignment(horizontal="left", vertical="center")
THIN = Side(style="thin", color="BFBFBF")
BORDER = Border(left=THIN, right=THIN, top=THIN, bottom=THIN)
# zebra
FILL_ODD = PatternFill("solid", fgColor="FFFFFF")
FILL_EVEN = PatternFill("solid", fgColor="EBF1DE")
# DCR status barvy
FILL_DCR_SITE = PatternFill("solid", fgColor="FFFF00") # žlutá — čeká lékař
FILL_DCR_CLARIO = PatternFill("solid", fgColor="BDD7EE") # modrá — čeká Clario
FILL_DCR_QC = PatternFill("solid", fgColor="F4B942") # oranžová — ReadyForQC
FILL_DCR_DONE = PatternFill("solid", fgColor="FFFFFF") # bílá — Completed
SCORE_COLS = {"Partial Mayo Score", "Modified Mayo Score", "Full Mayo Score"}
SCORE_FILL = PatternFill("solid", fgColor="FFC7CE") # červená pro skóre ≥ 5 (placeholder — nepoužíváme podmíněné formátování)
# ---------------------------------------------------------------------------
# Sestavení sheetu
# ---------------------------------------------------------------------------
def _build_sheet(ws, docs, columns, date_cols, center_cols, col_widths, row_font_fn=None, wrap_cols=None, header_row=1):
headers = [c[0] for c in columns]
for col_idx, header in enumerate(headers, 1):
cell = ws.cell(row=header_row, column=col_idx, value=header)
cell.font = HEADER_FONT
cell.fill = HEADER_FILL
cell.alignment = ALIGN_CTR
cell.border = BORDER
ws.row_dimensions[header_row].height = 28
data_start = header_row + 1
for row_idx, doc in enumerate(docs, data_start):
fill = FILL_EVEN if (row_idx - header_row) % 2 == 0 else FILL_ODD
font = row_font_fn(doc) if row_font_fn else CELL_FONT
for col_idx, (col_name, getter) in enumerate(columns, 1):
value = getter(doc)
if col_name in date_cols and isinstance(value, str):
value = _iso_to_date(value)
cell = ws.cell(row=row_idx, column=col_idx, value=value)
cell.font = font
cell.fill = fill
cell.border = BORDER
if wrap_cols and col_name in wrap_cols:
cell.alignment = Alignment(horizontal="left", vertical="top", wrap_text=True)
else:
cell.alignment = ALIGN_CTR if col_name in center_cols else ALIGN_LEFT
for col_idx, (col_name, _) in enumerate(columns, 1):
ws.column_dimensions[get_column_letter(col_idx)].width = col_widths.get(col_name, 14)
for col_name in date_cols:
if col_name in headers:
letter = get_column_letter(headers.index(col_name) + 1)
for row_idx in range(data_start, len(docs) + data_start):
ws[f"{letter}{row_idx}"].number_format = "DD-MMM-YYYY"
ws.freeze_panes = f"A{data_start}"
ws.auto_filter.ref = f"A{header_row}:{get_column_letter(len(headers))}{header_row}"
def _score_row_font(doc):
visit = doc["fields"].get("Visit", "")
try:
mod_mayo = int(doc["fields"].get("Modified Mayo Score", ""))
except (ValueError, TypeError):
mod_mayo = None
if visit == "I-0" and mod_mayo is not None and mod_mayo < 5:
return Font(size=10, bold=True, color="FF0000")
return CELL_FONT
def build_mayo_score_sheet(ws, docs):
_build_sheet(
ws, docs, COLUMNS_SCORE,
date_cols={"Visit Date", "Last Mayo Score Submission"},
center_cols={"KLIKNI SEM", "Visit", "Central Endoscopy Score", "PGA Score",
"Stool Frequency Sub-score", "Rectal Bleeding Sub-score",
"Partial Mayo Score", "Modified Mayo Score", "Full Mayo Score",
"Baseline Stool Frequency",
"Wk I-12 Responder", "Wk I-12 Remission", "Clinical Flare",
"Loss of Response", "Partial Mayo Post LoR", "Partial Mayo Non-Resp",
"Last Mayo Score Submission"},
col_widths={
"KLIKNI SEM": 14,
"Site": 18, "Subject ID": 16, "Visit": 12, "Visit Date": 14,
"Baseline Stool Frequency": 14, "Central Endoscopy Score": 14,
"PGA Score": 10, "Stool Frequency Sub-score": 14,
"Rectal Bleeding Sub-score": 14, "Partial Mayo Score": 14,
"Modified Mayo Score": 14, "Full Mayo Score": 13,
"Site Action": 22, "Last Mayo Score Submission": 16,
"Wk I-12 Responder": 14, "Wk I-12 Remission": 14,
"Clinical Flare": 14, "Loss of Response": 14,
"Partial Mayo Post LoR": 20, "Partial Mayo Non-Resp": 20,
},
row_font_fn=_score_row_font,
)
# Speciální styl pro sloupec KLIKNI SEM — vypadá jako tlačítko/odkaz
link_font = Font(size=10, bold=True, color="FFFFFF")
link_fill = PatternFill("solid", fgColor="2E75B6")
for row in range(2, len(docs) + 2):
cell = ws.cell(row=row, column=1)
cell.font = link_font
cell.fill = link_fill
cell.alignment = ALIGN_CTR
def build_mayo_diary_sheet(ws, docs):
_build_sheet(
ws, docs, COLUMNS_DIARY,
date_cols={"Report Date"},
center_cols={"Baseline Stool Count", "Stool Frequency", "Not Applicable",
"Constipation", "Diarrhea", "Irregularity"},
col_widths={
"Subject ID": 16, "Report Date": 14, "Baseline Stool Count": 14,
"Stool Frequency": 14, "MAYO050": 48, "Not Applicable": 14,
"Constipation": 14, "Diarrhea": 12, "Irregularity": 14,
},
)
def build_eligible_days_sheet(ws, score_docs, diary_docs):
# Lookup diary records by (subject_id, date_part YYYY-MM-DD)
diary_lookup: dict[tuple, dict] = {}
for d in diary_docs:
subj = d.get("subject", {}).get("id", "")
date_iso = d["fields"].get("Report Date", "")
date_part = date_iso[:10] if date_iso else ""
if subj and date_part:
diary_lookup[(subj, date_part)] = d
headers = [
"Included", "Subject ID", "Visit", "Visit Date", "Day",
"Report Date", "Baseline Stool Count", "Stool Frequency",
"MAYO050", "Not Applicable", "Constipation", "Diarrhea", "Irregularity",
]
col_widths = {
"Included": 10, "Subject ID": 16, "Visit": 10, "Visit Date": 14, "Day": 8,
"Report Date": 14, "Baseline Stool Count": 14, "Stool Frequency": 14,
"MAYO050": 48, "Not Applicable": 14, "Constipation": 14,
"Diarrhea": 12, "Irregularity": 14,
}
center_cols = {"Included", "Visit", "Day", "Baseline Stool Count", "Stool Frequency",
"Not Applicable", "Constipation", "Diarrhea", "Irregularity"}
date_cols = {"Visit Date", "Report Date"}
no_fill = PatternFill("solid", fgColor="FFF2CC") # žlutá pro excluded dny
for col_idx, header in enumerate(headers, 1):
cell = ws.cell(row=1, column=col_idx, value=header)
cell.font = HEADER_FONT
cell.fill = HEADER_FILL
cell.alignment = ALIGN_CTR
cell.border = BORDER
ws.row_dimensions[1].height = 28
row_idx = 2
for score_doc in score_docs:
subj = score_doc.get("subject", {}).get("id", "")
visit = score_doc["fields"].get("Visit", "")
visit_date = score_doc["fields"].get("Visit Date", "")
for n in range(1, 11):
day_date_iso = score_doc["fields"].get(f"Eligible Day (-{n})")
if not day_date_iso or day_date_iso == "-":
continue
date_part = day_date_iso[:10]
excl_reason = score_doc["fields"].get(f"Day (-{n}) Excluded Reason(s)", "")
included = "No" if excl_reason and excl_reason != "-" else "Yes"
diary = diary_lookup.get((subj, date_part), {})
df = diary.get("fields", {})
fill = no_fill if included == "No" else (FILL_EVEN if row_idx % 2 == 0 else FILL_ODD)
font = Font(size=10, color="808080") if included == "No" else CELL_FONT
values = [
included,
subj,
visit,
_iso_to_date(visit_date) if isinstance(visit_date, str) else visit_date,
f"-{n}",
_iso_to_date(day_date_iso),
_num(df.get("Baseline Stool Count", "")),
_num(df.get("Stool Frequency", "")),
df.get("MAYO050", ""),
df.get("Not Applicable", ""),
df.get("Constipation", ""),
df.get("Diarrhea", ""),
df.get("Irregularity", ""),
]
for col_idx, (header, value) in enumerate(zip(headers, values), 1):
cell = ws.cell(row=row_idx, column=col_idx, value=value)
cell.font = font
cell.fill = fill
cell.border = BORDER
if header in date_cols:
cell.number_format = "DD-MMM-YYYY"
cell.alignment = ALIGN_CTR if header in center_cols else ALIGN_LEFT
row_idx += 1
for col_idx, header in enumerate(headers, 1):
ws.column_dimensions[get_column_letter(col_idx)].width = col_widths.get(header, 14)
ws.freeze_panes = "A2"
ws.auto_filter.ref = f"A1:{get_column_letter(len(headers))}1"
def _build_dcr_legend(ws):
"""Vloží legendu do řádků 14, prázdný řádek 5. Data začínají od řádku 6."""
legend = [
(FILL_DCR_SITE, "Čeká lékař — Next Action Required = Site (lékař musí odpovědět nebo potvrdit)"),
(FILL_DCR_CLARIO, "Čeká Clario — Next Action Required = Clario DM (Clario dostalo podklady, provede změnu)"),
(FILL_DCR_QC, "ReadyForQC — Clario provedlo změny, čeká na finální QC kontrolu"),
(FILL_DCR_DONE, "Completed / Resolved — DCR je uzavřen"),
]
for i, (fill, text) in enumerate(legend, 1):
a = ws.cell(row=i, column=1, value="")
a.fill = fill
a.border = BORDER
b = ws.cell(row=i, column=2, value=text)
b.font = Font(size=10, bold=True)
b.alignment = ALIGN_LEFT
# řádek 5 prázdný — nic nedělat
def _dcr_row_fill(doc):
"""Vrátí fill barvu dle stavu DCR."""
status = doc["fields"].get("Status", "")
next_action = doc["fields"].get("Next Action Required", "")
if status in ("Completed", "Resolved"):
return FILL_DCR_DONE
if status == "ReadyForQC":
return FILL_DCR_QC
if "Site" in next_action:
return FILL_DCR_SITE
if "Clario" in next_action or next_action == "":
return FILL_DCR_CLARIO
return FILL_ODD
def build_ecoa_dcrs_sheet(ws, docs):
_build_dcr_legend(ws)
docs_sorted = sorted(docs, key=lambda d: (
d.get("site", {}).get("name", ""),
d.get("subject", {}).get("id", ""),
d["fields"].get("Creation Date UTC", ""),
))
_build_sheet(
ws, docs_sorted, COLUMNS_ECOA_DCRS,
date_cols={"Creation Date UTC", "Date of Last Action UTC"},
center_cols={"Status", "Type", "Next Action Required", "Category",
"Total Open Time (Days)", "Current Status Time (Days)",
"Baseline Stool Count", "firstSeen", "lastSeen"},
col_widths={
"Site": 16, "Subject ID": 16, "Data Correction ID": 18,
"PI Name": 18, "Creation Date UTC": 14, "Date of Last Action UTC": 14,
"Status": 14, "Type": 16, "Next Action Required": 16, "Category": 20,
"Total Open Period": 14, "Total Open Time (Days)": 14,
"Current Status Time (Days)": 16, "Reason for Change": 20,
"Description": 50, "Resolution": 50, "Query History": 60,
"Age at Informed Consent": 14, "Baseline Stool Count": 14,
"firstSeen": 12, "lastSeen": 12,
},
wrap_cols={"Reason for Change", "Description", "Resolution", "Query History"},
header_row=6,
row_font_fn=lambda doc: CELL_FONT,
)
# Přebarvení řádků dle DCR stavu (přepíše zebra fill)
data_start = 7
for row_idx, doc in enumerate(docs_sorted, data_start):
fill = _dcr_row_fill(doc)
for col_idx in range(1, len(COLUMNS_ECOA_DCRS) + 1):
ws.cell(row=row_idx, column=col_idx).fill = fill
def build_ecg_dcrs_sheet(ws, docs):
_build_dcr_legend(ws)
docs_sorted = sorted(docs, key=lambda d: (
d.get("site", {}).get("name", ""),
d.get("subject", {}).get("id", ""),
d["fields"].get("Creation Date UTC", ""),
))
_build_sheet(
ws, docs_sorted, COLUMNS_ECG_DCRS,
date_cols={"Creation Date UTC", "Date of Last Action UTC"},
center_cols={"Status", "Type", "Next Action Required", "Category",
"Total Open Time (Days)", "Current Status Time (Days)",
"firstSeen", "lastSeen"},
col_widths={
"Site ID": 14, "Subject Number": 16, "Data Correction ID": 16,
"PI Name": 18, "Age": 10, "Creation Date UTC": 14,
"Date of Last Action UTC": 14, "Status": 14, "Type": 12,
"Next Action Required": 16, "Category": 14,
"Total Open Period": 14, "Total Open Time (Days)": 14,
"Current Status Time (Days)": 16, "Reason for Change": 20,
"Query History": 60, "firstSeen": 12, "lastSeen": 12,
},
wrap_cols={"Query History"},
header_row=6,
row_font_fn=lambda doc: CELL_FONT,
)
# Přebarvení řádků dle DCR stavu
data_start = 7
for row_idx, doc in enumerate(docs_sorted, data_start):
fill = _dcr_row_fill(doc)
for col_idx in range(1, len(COLUMNS_ECG_DCRS) + 1):
ws.cell(row=row_idx, column=col_idx).fill = fill
# ---------------------------------------------------------------------------
# Helpers: výstupní cesta
# ---------------------------------------------------------------------------
def _unique_path(directory: Path, stem: str, suffix: str) -> Path:
candidate = directory / f"{stem}{suffix}"
if not candidate.exists():
return candidate
n = 2
while True:
candidate = directory / f"{stem} ({n}){suffix}"
if not candidate.exists():
return candidate
n += 1
# ---------------------------------------------------------------------------
# Timing helper
# ---------------------------------------------------------------------------
def _tick(label: str, t0: float) -> float:
"""Vypíše dobu od t0 a vrátí aktuální čas jako nový t0."""
elapsed = time.perf_counter() - t0
print(f" {label:<30} {elapsed:6.2f} s")
return time.perf_counter()
# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------
def main():
t_total = time.perf_counter()
print("Spouštím generování reportu...")
print()
# -- 1. MongoDB: připojení + načtení + seřazení --------------------------
t = time.perf_counter()
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
client.admin.command("ping")
db = client[DB_NAME]
score_docs = list(db["Clario.MayoScore"].find({}))
diary_docs = list(db["Clario.MayoDiary"].find({}))
ecoa_dcr_docs = list(db["Clario.eCOA_DCRs"].find({}))
ecg_dcr_docs = list(db["Clario.ECG_DCRs"].find({}))
client.close()
score_docs.sort(key=_visit_sort_key)
diary_docs.sort(key=lambda d: (
d.get("subject", {}).get("id", ""),
d["fields"].get("Report Date", ""),
))
t = _tick(f"MongoDB (ping, fetch, sort → {len(score_docs)} + {len(diary_docs)} + {len(ecoa_dcr_docs)} + {len(ecg_dcr_docs)} záznamů)", t)
# -- 24. Tvorba listů ---------------------------------------------------
wb = Workbook()
ws_score = wb.active
ws_score.title = "MayoScore"
build_mayo_score_sheet(ws_score, score_docs)
t = _tick("List MayoScore (KLIKNI SEM, zebra, červené I-0, autofilter)", t)
ws_diary = wb.create_sheet("MayoDiary")
build_mayo_diary_sheet(ws_diary, diary_docs)
t = _tick("List MayoDiary (zebra, formátování dat, autofilter)", t)
ws_days = wb.create_sheet("EligibleDays")
build_eligible_days_sheet(ws_days, score_docs, diary_docs)
t = _tick("List EligibleDays (diary lookup, included/excluded flag, autofilter)", t)
ws_ecoa = wb.create_sheet("eCOA_DCRs")
build_ecoa_dcrs_sheet(ws_ecoa, ecoa_dcr_docs)
t = _tick(f"List eCOA_DCRs ({len(ecoa_dcr_docs)} záznamů)", t)
ws_ecg = wb.create_sheet("ECG_DCRs")
build_ecg_dcrs_sheet(ws_ecg, ecg_dcr_docs)
t = _tick(f"List ECG_DCRs ({len(ecg_dcr_docs)} záznamů)", t)
# -- 5. Uložení XLSX -----------------------------------------------------
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
today = datetime.now().strftime("%Y-%m-%d")
base_stem = f"{today} 77242113UCO3001 Clario Reports"
xlsm_path = _unique_path(OUTPUT_DIR, base_stem, ".xlsm")
xlsx_path = xlsm_path.with_suffix(".xlsx")
wb.save(str(xlsx_path))
t = _tick("Uložení XLSX (openpyxl, dočasný soubor)", t)
# -- 6. Injektování VBA --------------------------------------------------
inject_vba(xlsx_path, xlsm_path)
xlsx_path.unlink(missing_ok=True)
_tick("Injektování VBA (xlwings: open → AddFromString → SaveAs .xlsm)", t)
# -- Souhrn --------------------------------------------------------------
total = time.perf_counter() - t_total
print()
print(f" {'Celkem':<30} {total:6.2f} s")
print()
print(f"Uloženo: {xlsm_path}")
def inject_vba(xlsx_path: Path, xlsm_path: Path) -> None:
vba_code = '''\
Private Sub Worksheet_SelectionChange(ByVal Target As Range)
If Target.Row < 2 Then Exit Sub
If Target.Rows.Count > 1 Then Exit Sub
If Target.Column <> 1 Then Exit Sub
Dim subjectId As String
Dim visit As String
subjectId = CStr(Me.Cells(Target.Row, 3).Value)
visit = CStr(Me.Cells(Target.Row, 4).Value)
If subjectId = "" Or visit = "" Then Exit Sub
Dim ws As Worksheet
On Error Resume Next
Set ws = ThisWorkbook.Sheets("EligibleDays")
On Error GoTo 0
If ws Is Nothing Then Exit Sub
Application.ScreenUpdating = False
ws.AutoFilterMode = False
ws.Range("A1").AutoFilter
ws.Range("A1").AutoFilter Field:=2, Criteria1:=subjectId
ws.Range("A1").AutoFilter Field:=3, Criteria1:=visit
ws.Activate
ws.Range("A2").Select
Application.ScreenUpdating = True
End Sub
'''
app = xw.App(visible=False)
try:
wb = app.books.open(str(xlsx_path))
# Najdi VBComponent odpovídající listu "MayoScore" podle tab názvu
vb_comp = None
for comp in wb.api.VBProject.VBComponents:
if comp.Type == 100: # xlSheet
try:
if comp.Properties("Name").Value == "MayoScore":
vb_comp = comp
break
except Exception:
pass
if vb_comp is None:
# fallback: první sheet (Sheet1)
vb_comp = wb.api.VBProject.VBComponents("Sheet1")
vb_comp.CodeModule.AddFromString(vba_code)
wb.api.SaveAs(str(xlsm_path), FileFormat=52) # 52 = xlOpenXMLWorkbookMacroEnabled
wb.close()
finally:
app.quit()
if __name__ == "__main__":
main()
+293
View File
@@ -0,0 +1,293 @@
# Název: janssenpc_file_send.py
# Verze: 2.2
# Datum: 2026-06-02
# Popis: Přejmenuje soubory ve složce ##JNJPrenos, odešle je na msgs.buzalka.cz
# a přesune do podsložky Trash. Loguje průběh do file_send.log vedle skriptu.
# Podporuje: PANORAMA Site Contacts (xlsx), Panorama Dashboard (xlsx),
# Site Visit Report (xlsx), Follow-Up Letter (xlsx),
# Clario MayoScore (csv), Clario MayoDiary (csv),
# Clario Data Corrections / DCRs (csv).
import os
import time
import shutil
import requests
import pandas as pd
from pathlib import Path
from datetime import datetime
TOKEN = "13e1bb01-9fd5-44a8-8ce9-4ee27133d340"
UPLOAD_URL = "https://msgs.buzalka.cz/upload-dropbox"
SOURCE_DIR = Path(r"C:\Users\vbuzalka\OneDrive - JNJ\##JNJPrenos")
TRASH_DIR = SOURCE_DIR / "Trash"
LOG_FILE = Path(__file__).parent / "file_send.log"
MAYO_DIARY_COLUMNS = [
'Protocol', 'Country', 'Site', 'PI Name', 'Subject ID',
'Report Date', 'Report Start Date/Time', 'Report End Date/Time',
'Stool Frequency', 'Form Number', 'Role', 'Original Source',
]
MAYO_SCORE_COLUMNS = [
'Protocol', 'Study Population', 'Country', 'Site', 'Principal Investigator',
'Participant ID', 'Baseline Stool Frequency', 'Visit', 'Visit Date',
'Endoscopy Completed?', 'Central Endoscopy Score', 'Local Endoscopy Score',
'Partial Mayo Score', 'Full Mayo Score',
]
DCR_ECOA_COLUMNS = [
'Protocol', 'Data Correction ID', 'Description', 'Query History',
]
DCR_ECG_COLUMNS = [
'Protocol', 'Data Correction ID', 'Site ID', 'PI_NAME', 'Subject Number', 'Query History',
]
PANORAMA_COLUMNS = [
'Part', 'Source', 'Sector', 'TA', 'Protocol ID', 'Interventional',
'Region', 'Country Name', 'Institution Name', 'Site City',
'Site Zip/Postal Code', 'Site Address', 'MSID', 'Site ID',
'Site Status', 'SM Full Name', 'PI Name', 'St F Subj Enr Act',
'ID', 'Category', 'Type', 'Priority', 'Severity', 'Description',
'Brief Description - Subject ID', 'Comments', 'Created By',
'Create Date', 'Last Modified Date', 'Start Date', 'Due Date',
'End Date', 'Status', 'Days Outstanding', 'Action Taken',
'Escalated To', 'Visit Report Status', 'Visit Report Approved',
'Visit Report Type', 'Visit Report Status End Date', 'Active',
'Association', 'Deviation', 'Deviation Closed Date', 'Reason For Exclusion'
]
def log(msg: str):
ts = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
line = f"[{ts}] {msg}"
print(line)
with LOG_FILE.open("a", encoding="utf-8") as lf:
lf.write(line + "\n")
def move_to_trash(f: Path):
TRASH_DIR.mkdir(exist_ok=True)
dest = TRASH_DIR / f.name
if dest.exists():
ts = datetime.now().strftime('%Y%m%d_%H%M%S')
dest = TRASH_DIR / f"{f.stem}_{ts}{f.suffix}"
shutil.move(str(f), dest)
def get_timestamp(file_path: str) -> str:
return datetime.fromtimestamp(os.path.getmtime(file_path)).strftime('%Y-%m-%d_%H-%M-%S')
def prejmenuj(directory: Path) -> None:
log(f"--- Přejmenování, adresář: {directory} ---")
files = [f for f in directory.iterdir() if f.is_file()]
log(f" Nalezeno souborů: {len(files)}{[f.name for f in files]}")
for f in files:
filename = f.name
file_path = str(f)
# 0a. CLARIO MAYO DIARY (CSV)
if 'MAYO-DIARY' in filename and filename.endswith('.csv'):
log(f" Detekován MayoDiary: {filename}")
try:
df = pd.read_csv(file_path)
missing = set(MAYO_DIARY_COLUMNS) - set(df.columns)
if not missing:
protocols = df['Protocol'].dropna().unique()
log(f" Protocol: {list(protocols)}")
if len(protocols) > 0:
study = str(protocols[0]).strip()
new_name = f"{get_timestamp(file_path)} {study} Clario MayoDiary.csv"
f.rename(directory / new_name)
log(f" ÚSPĚCH: -> '{new_name}'")
else:
log(f" VAROVÁNÍ: Sloupec Protocol je prázdný.")
else:
log(f" PŘESKOČENO: Chybí sloupce: {missing}")
except Exception as e:
log(f" CHYBA: {e}")
continue
# 0b. CLARIO MAYO SCORE (CSV)
if 'Custom.MayoScoreReport' in filename and filename.endswith('.csv'):
log(f" Detekován MayoScore: {filename}")
try:
df = pd.read_csv(file_path)
missing = set(MAYO_SCORE_COLUMNS) - set(df.columns)
if not missing:
protocols = df['Protocol'].dropna().unique()
log(f" Protocol: {list(protocols)}")
if len(protocols) > 0:
study = str(protocols[0]).strip()
new_name = f"{get_timestamp(file_path)} {study} Clario MayoScore.csv"
f.rename(directory / new_name)
log(f" ÚSPĚCH: -> '{new_name}'")
else:
log(f" VAROVÁNÍ: Sloupec Protocol je prázdný.")
else:
log(f" PŘESKOČENO: Chybí sloupce: {missing}")
except Exception as e:
log(f" CHYBA: {e}")
continue
# 0c. CLARIO DATA CORRECTIONS (CSV) — ECG nebo eCOA
if filename.endswith('.csv'):
try:
df = pd.read_csv(file_path, nrows=2)
cols = set(df.columns)
log(f" CSV sloupce ({filename}): {sorted(cols)}")
missing_ecg = set(DCR_ECG_COLUMNS) - cols
missing_ecoa = set(DCR_ECOA_COLUMNS) - cols
log(f" Chybí pro ECG: {missing_ecg or ''}")
log(f" Chybí pro eCOA: {missing_ecoa or ''}")
if not missing_ecg:
label = "Clario ECG DCRs"
elif not missing_ecoa:
label = "Clario eCOA DCRs"
else:
log(f" Neznámý CSV typ — bude odeslán bez přejmenování: {filename}")
# nepokračujeme continue — soubor projde dál k odeslání
label = None
if label:
log(f" Detekován {label}: {filename}")
protocols = df['Protocol'].dropna().unique()
log(f" Protocol: {list(protocols)}")
if len(protocols) > 0:
study = str(protocols[0]).strip()
new_name = f"{get_timestamp(file_path)} {study} {label}.csv"
f.rename(directory / new_name)
log(f" ÚSPĚCH přejmenování: -> '{new_name}'")
else:
log(f" VAROVÁNÍ: Sloupec Protocol je prázdný — odesílám pod původním názvem.")
except Exception as e:
log(f" CHYBA při zpracování CSV {filename}: {e}")
continue
# Ostatní — jen xlsx
if not filename.endswith('.xlsx'):
log(f" Přeskočeno (neznámý typ): {filename}")
continue
# 1a. PANORAMA SITE CONTACTS (XLSX) — soubor pojmenovaný "PANORAMA Dashboard"
if 'PANORAMA Dashboard' in filename:
log(f" Detekován PANORAMA Site Contacts: {filename}")
try:
with pd.ExcelFile(file_path) as xl:
sheet_names = xl.sheet_names
if 'Site Contacts' in sheet_names:
df_a1 = xl.parse('Site Contacts', nrows=1, header=None)
a1 = str(df_a1.iloc[0, 0]) if not df_a1.empty else ''
else:
a1 = None
# soubor je nyní zavřen — přejmenování proběhne bez chyby
if a1 is None:
log(f" PŘESKOČENO: List 'Site Contacts' nenalezen.")
elif 'Title: Site Contacts' in a1:
new_name = f"{get_timestamp(file_path)} PANORAMA Site Contacts.xlsx"
f.rename(directory / new_name)
log(f" ÚSPĚCH: -> '{new_name}'")
else:
log(f" PŘESKOČENO: A1 neodpovídá vzoru ({a1[:50]})")
except Exception as e:
log(f" CHYBA: {e}")
continue
# 1. PANORAMA DASHBOARD (XLSX)
if 'Panorama Dashboard' in filename:
log(f" Detekován Panorama: {filename}")
try:
df = pd.read_excel(file_path, skiprows=5)
missing = set(PANORAMA_COLUMNS) - set(df.columns)
if not missing:
ids = df['Protocol ID'].dropna().unique()
log(f" Protocol ID: {list(ids)}")
if len(ids) > 0:
study = str(ids[0]).strip()
new_name = f"{get_timestamp(file_path)} {study} Panorama Deviations and Issues.xlsx"
f.rename(directory / new_name)
log(f" ÚSPĚCH: -> '{new_name}'")
else:
log(f" VAROVÁNÍ: Protocol ID je prázdný.")
else:
log(f" PŘESKOČENO: Chybí sloupce: {missing}")
except Exception as e:
log(f" CHYBA: {e}")
continue
# 2. SITE VISIT REPORT A FOLLOW-UP LETTER (XLSX)
try:
df_a1 = pd.read_excel(file_path, nrows=1, header=None)
if not df_a1.empty:
a1 = str(df_a1.iloc[0, 0])
log(f" A1: {a1[:80]}")
is_site_visit = "Title: Site Visit Report Details" in a1
is_follow_up = "Title: Follow-Up Letter Details" in a1
if is_site_visit or is_follow_up:
suffix = "Site Visit Details.xlsx" if is_site_visit else "FUL details.xlsx"
log(f" Detekován {'Site Visit' if is_site_visit else 'Follow-Up Letter'}: {filename}")
df = pd.read_excel(file_path, skiprows=5)
if 'Protocol ID' in df.columns:
ids = df['Protocol ID'].dropna().unique()
log(f" Protocol ID: {list(ids)}")
if len(ids) > 0:
study = str(ids[0]).strip()
new_name = f"{get_timestamp(file_path)} {study} {suffix}"
f.rename(directory / new_name)
log(f" ÚSPĚCH: -> '{new_name}'")
else:
log(f" VAROVÁNÍ: Protocol ID je prázdný.")
else:
log(f" PŘESKOČENO: Chybí sloupec Protocol ID.")
else:
log(f" Přeskočeno (neznámý xlsx obsah): {filename}")
except Exception as e:
log(f" CHYBA: {e}")
log("--- Přejmenování dokončeno ---")
# === HLAVNÍ LOGIKA ===
log("=== Spuštění ===")
log(f"Zdrojový adresář: {SOURCE_DIR} (existuje: {SOURCE_DIR.exists()})")
# 1. Přejmenuj
prejmenuj(SOURCE_DIR)
# 2. Počkej 10 vteřin
log("Čekám 10 vteřin...")
time.sleep(10)
# 3. Odešli soubory
files = [f for f in SOURCE_DIR.iterdir() if f.is_file()]
log(f"Souborů k odeslání: {len(files)}")
for f in files:
log(f" Nalezen: {f.name}")
if not files:
log("Žádné soubory k odeslání.")
else:
for f in files:
try:
with f.open("rb") as fh:
resp = requests.post(
UPLOAD_URL,
headers={"Authorization": f"Bearer {TOKEN}"},
files={"file": (f.name, fh, "application/octet-stream")},
timeout=120,
)
resp.raise_for_status()
status = resp.json().get('status', '?').upper()
log(f" {status:10} | {f.name}")
move_to_trash(f)
log(f" PŘESUNUTO | {f.name} -> Trash")
except Exception as e:
log(f" CHYBA | {f.name} | {e}")
log("=== Hotovo ===")
@@ -0,0 +1,10 @@
{
"pk": 3237,
"title": "Subject_Number_Creation",
"label": "Janssen 77242113UCO3001 Subject CZ100132003 has been created in IRT at site DD5-CZ10013",
"event": "Create",
"actual_date": "2026-05-06",
"subject": "CZ100132003",
"study": "77242113UCO3001",
"text": "77242113UCO3001\n\nJanssen Pharmaceuticals\nhttps://janssen.4gclinical.com\n\nSubject CZ100132003 has been created in IRT.\n\nSite Details\n\nLocation: CZE\n\nSite: DD5-CZ10013\n\nInvestigator: David Stepek\n\nSubject Details\n\nSubject: CZ100132003\n\nIRT Subject Status: Screened\n\nRescreened Subject: No\n\nCohort: Adult subjects (18 years or older)\n\nInformed Consent Date at Subject Creation: 06-May-2026\n\n ADT-IR: No\n\n 3 or More Advanced Therapies: No\n\n Ustekinumab: No\n\n Only Oral 5-ASA Compounds: No\n\nDate of Subject Creation in IRT: 06-May-2026\n\nTransaction Date/Time (site local): 06-May-2026 10:33:13\n\nTransaction Date/Time (system local): 06-May-2026 08:33:13\n\nTransaction performed by: dstepek@vnbrno.cz\n\nIf you have questions about this notification, please contact 4G Clinical Support at http://support.4gclinical.com"
}
@@ -0,0 +1,10 @@
{
"pk": 3510,
"title": "Subject_Number_Creation",
"label": "Janssen 77242113UCO3001 Subject CZ100032001 has been created in IRT at site DD5-CZ10003",
"event": "Create",
"actual_date": "2026-05-13",
"subject": "CZ100032001",
"study": "77242113UCO3001",
"text": "77242113UCO3001\n\nJanssen Pharmaceuticals\nhttps://janssen.4gclinical.com\n\nSubject CZ100032001 has been created in IRT.\n\nSite Details\n\nLocation: CZE\n\nSite: DD5-CZ10003\n\nInvestigator: Leksa Vaclav\n\nSubject Details\n\nSubject: CZ100032001\n\nIRT Subject Status: Screened\n\nRescreened Subject: No\n\nCohort: Adult subjects (18 years or older)\n\nInformed Consent Date at Subject Creation: 13-May-2026\n\n ADT-IR: No\n\n 3 or More Advanced Therapies: No\n\n Ustekinumab: No\n\n Only Oral 5-ASA Compounds: No\n\nDate of Subject Creation in IRT: 13-May-2026\n\nTransaction Date/Time (site local): 13-May-2026 07:44:11\n\nTransaction Date/Time (system local): 13-May-2026 05:44:11\n\nTransaction performed by: vaclav.leksa@seznam.cz\n\nIf you have questions about this notification, please contact 4G Clinical Support at http://support.4gclinical.com"
}
@@ -0,0 +1,10 @@
{
"pk": 4231,
"title": "Subject_Number_Creation",
"label": "Janssen 77242113UCO3001 Subject CZ100162002 has been created in IRT at site DD5-CZ10016",
"event": "Create",
"actual_date": "2026-05-27",
"subject": "CZ100162002",
"study": "77242113UCO3001",
"text": "77242113UCO3001\n\nJanssen Pharmaceuticals\nhttps://janssen.4gclinical.com\n\nSubject CZ100162002 has been created in IRT.\n\nSite Details\n\nLocation: CZE\n\nSite: DD5-CZ10016\n\nInvestigator: Robert Mudr\n\nSubject Details\n\nSubject: CZ100162002\n\nIRT Subject Status: Screened\n\nRescreened Subject: No\n\nCohort: Adult subjects (18 years or older)\n\nInformed Consent Date at Subject Creation: 27-May-2026\n\n ADT-IR: Yes\n\n 3 or More Advanced Therapies: No\n\n Ustekinumab: No\n\n Only Oral 5-ASA Compounds: No\n\nDate of Subject Creation in IRT: 27-May-2026\n\nTransaction Date/Time (site local): 27-May-2026 11:55:28\n\nTransaction Date/Time (system local): 27-May-2026 09:55:28\n\nTransaction performed by: petr.pekny@nmskb.cz\n\nIf you have questions about this notification, please contact 4G Clinical Support at http://support.4gclinical.com"
}
@@ -0,0 +1,10 @@
{
"pk": 4271,
"title": "Subject_Number_Creation",
"label": "Janssen 77242113UCO3001 Subject CZ100012004 has been created in IRT at site DD5-CZ10001",
"event": "Create",
"actual_date": "2026-05-28",
"subject": "CZ100012004",
"study": "77242113UCO3001",
"text": "77242113UCO3001\n\nJanssen Pharmaceuticals\nhttps://janssen.4gclinical.com\n\nSubject CZ100012004 has been created in IRT.\n\nSite Details\n\nLocation: CZE\n\nSite: DD5-CZ10001\n\nInvestigator: Matej Falc\n\nSubject Details\n\nSubject: CZ100012004\n\nIRT Subject Status: Screened\n\nRescreened Subject: No\n\nCohort: Adult subjects (18 years or older)\n\nInformed Consent Date at Subject Creation: 28-May-2026\n\n ADT-IR: No\n\n 3 or More Advanced Therapies: No\n\n Ustekinumab: No\n\n Only Oral 5-ASA Compounds: No\n\nDate of Subject Creation in IRT: 28-May-2026\n\nTransaction Date/Time (site local): 28-May-2026 07:14:21\n\nTransaction Date/Time (system local): 28-May-2026 05:14:21\n\nTransaction performed by: matesfalc@seznam.cz\n\nIf you have questions about this notification, please contact 4G Clinical Support at http://support.4gclinical.com"
}
@@ -0,0 +1,10 @@
{
"pk": 4461,
"title": "Randomized",
"label": "Janssen 77242113UCO3001 Subject randomized CZ100132003 at site DD5-CZ10013",
"event": "I0",
"actual_date": "2026-06-02",
"subject": "CZ100132003",
"study": "77242113UCO3001",
"text": "77242113UCO3001\n\nJanssen Pharmaceuticals\nhttps://janssen.4gclinical.com\n\nSubject CZ100132003 has been randomized.\n\n The following medication(s) has been assigned to the subject:\n\n \n \n Medication No\n Medication Type\n Packaged Lot No\n Expiration Date\n \n \n \n 1056513\n Icotrokinra 320mg / placebo\n 4393030\n 19-Jan-2027\n \n \n \n\nSite Details\n\nLocation: CZE\n\nSite: DD5-CZ10013\n\nInvestigator: David Stepek\n\nSubject Details\n\nSubject: CZ100132003\n\nIRT Subject Status: Randomized\n\nCohort: Adult subjects (18 years or older)\n\n ADT-IR: No\n\n 3 or More Advanced Therapies: No\n\n Ustekinumab: No\n\n Only Oral 5-ASA Compounds: No\n \n Isolated Proctitis: No\n\nTransaction Date/Time (site local): 02-Jun-2026 08:19:11\n\nTransaction Date/Time (system local): 02-Jun-2026 06:19:11\n\nTransaction performed by: dstepek@vnbrno.cz\n\nIf you have questions about this notification, please contact 4G Clinical Support at http://support.4gclinical.com"
}
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
@@ -0,0 +1,449 @@
"""
download_attachments_v1.0.py
Nazev: download_attachments_v1.0.py
Verze: 1.0
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Stahuje skutecne prilohy (is_inline=False) vsech emailu z MongoDB kolekce
ordinace@buzalkova.cz primo pres Microsoft Graph API a uklada je do
adresare /mnt/Emails/ordinace@buzalkova.cz/Attachments/.
Deduplikace podle SHA256 hashe obsahu:
- stejny hash = soubor uz existuje -> preskoci
- prvni vyskytu souboru: ulozi pod puvodnimnazvem
- kolize nazvu (stejny nazev, jiny hash): faktura_2.pdf, faktura_3.pdf ...
Po ulozeni aktualizuje MongoDB:
- v email dokumentu: kazda priloha dostane file_hash + local_path
- kolekce emaily.attachments_index: _id=hash, filename, path, size_bytes,
mime_type, first_seen_at, ref_count (pocet emailu ktery ji obsahuje)
Bezpecne prerusit a opakovat:
- zpravy kde jsou vsechny prilohy uz stazene (maji file_hash) se preskoci
- --force-recheck znovu overi i uz stazene (pro pripad zmen na disku)
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
python download_attachments_v1.0.py # stahni vse co chybi
python download_attachments_v1.0.py --limit 50 # test na prvnich 50 emailech
python download_attachments_v1.0.py --force-recheck # overi i uz stazene
Docker (po pridani mountu /mnt/user/Emails -> /mnt/Emails):
docker exec -it python-runner python /scripts/download_attachments_v1.0.py
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
Struktura na disku:
/mnt/Emails/
└── ordinace@buzalkova.cz/
└── Attachments/
├── faktura_2026.pdf
├── vysledky_lab.pdf
├── vysledky_lab_2.pdf <- kolize nazvu, jiny obsah
└── ...
Kolekce emaily.attachments_index:
_id SHA256 hash (hex)
filename nazev souboru na disku (prvni vyskytu)
local_path relativni cesta od Attachments/ (zatim = filename)
size_bytes velikost souboru
mime_type MIME typ
first_seen_at datetime UTC
ref_count v kolika emailech se tato priloha vyskytuje
Aktualizace v email dokumentu (kolekce ordinace@buzalkova.cz):
attachments[i].file_hash SHA256 hash
attachments[i].local_path cesta relativni od Attachments/
Historie verzi:
1.0 2026-06-02 Inicialni verze
"""
import sys
import hashlib
import logging
import argparse
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from pymongo import MongoClient, UpdateOne
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_MAILBOX = "ordinace@buzalkova.cz"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL_EMAILS = "ordinace@buzalkova.cz"
MONGO_COL_INDEX = "attachments_index"
ATTACHMENTS_DIR = Path("/mnt/Emails/ordinace@buzalkova.cz/Attachments")
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.0"
BATCH_SIZE = 50
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
_graph_token: Optional[str] = None
# ─── Graph API ────────────────────────────────────────────────────────────────
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get_bytes(url: str) -> bytes:
"""Stahne binarni obsah prilohy."""
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, timeout=120, stream=True)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.content
raise RuntimeError(f"Graph GET bytes failed: {url}")
def graph_get_json(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET json failed: {url}")
def fetch_attachment_content(graph_message_id: str, attachment_id: str) -> Optional[bytes]:
"""Stahne obsah prilohy pres Graph API."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/messages/{graph_message_id}/attachments/{attachment_id}/$value"
try:
return graph_get_bytes(url)
except Exception as e:
logging.error("fetch_attachment_content failed [msg=%s att=%s]: %s", graph_message_id, attachment_id, e)
return None
def fetch_message_attachments(graph_message_id: str) -> list[dict]:
"""Nacte seznam priloh zpravy z Graph API (metadata vcetne attachment ID)."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/messages/{graph_message_id}/attachments"
try:
data = graph_get_json(url, {"$select": "id,name,contentType,size,isInline,contentId"})
return data.get("value", [])
except Exception as e:
logging.error("fetch_message_attachments failed [%s]: %s", graph_message_id, e)
return []
# ─── Dedup + ukládání ─────────────────────────────────────────────────────────
def sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def resolve_filename(desired_name: str, att_dir: Path, hash_val: str, index_col) -> str:
"""
Vrati nazev souboru ktery pouzit pro ulozeni.
Pokud desired_name jiz existuje s jinym hashem, prida suffix _2, _3 ...
"""
# Zkontroluj jestli existujici soubor se stejnym nazvem ma stejny hash
existing = index_col.find_one({"filename": desired_name})
if existing:
if existing["_id"] == hash_val:
return desired_name # Stejny hash, stejne jmeno — dedup hit
# Jiny hash — hledej volny suffix
stem = Path(desired_name).stem
suffix = Path(desired_name).suffix
n = 2
while True:
candidate = f"{stem}_{n}{suffix}"
if not (att_dir / candidate).exists():
# Overi ze ani v indexu neni tento kandidat s jinym hashem
ex2 = index_col.find_one({"filename": candidate})
if not ex2 or ex2["_id"] == hash_val:
return candidate
n += 1
return desired_name
def save_attachment(content: bytes, original_name: str, att_dir: Path, index_col) -> tuple[str, str, bool]:
"""
Ulozi prilohu s deduplikaci.
Vraci (hash, local_path, was_new):
was_new=True -> soubor byl ulozen
was_new=False -> hash uz existoval, soubor preskocen
"""
hash_val = sha256(content)
# Zkontroluj index — pokud hash uz existuje, vrat existujici zaznam
existing = index_col.find_one({"_id": hash_val})
if existing:
# Zvys pocitadlo referenci
index_col.update_one({"_id": hash_val}, {"$inc": {"ref_count": 1}})
return hash_val, existing["local_path"], False
# Novy soubor — urcit nazev
safe_name = "".join(c if c.isalnum() or c in "._- " else "_" for c in original_name).strip()
if not safe_name:
safe_name = f"attachment_{hash_val[:8]}"
filename = resolve_filename(safe_name, att_dir, hash_val, index_col)
file_path = att_dir / filename
# Uloz soubor
file_path.write_bytes(content)
# Zaznamenej do indexu
index_col.insert_one({
"_id": hash_val,
"filename": filename,
"local_path": filename,
"size_bytes": len(content),
"mime_type": "",
"first_seen_at": datetime.now(timezone.utc).replace(tzinfo=None),
"ref_count": 1,
})
return hash_val, filename, True
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"download_attachments v{SCRIPT_VERSION}")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N emailu (0 = vse)")
ap.add_argument("--force-recheck", action="store_true",
help="Znovu overi i emaily kde prilohy uz maji file_hash")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
start = datetime.now()
print(f"=== download_attachments v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {GRAPH_MAILBOX}")
print(f"Cilovy adresar: {ATTACHMENTS_DIR}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}")
# Adresar
ATTACHMENTS_DIR.mkdir(parents=True, exist_ok=True)
print(f" Adresar OK")
# Graph
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
# MongoDB
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col_emails = client[MONGO_DB][MONGO_COL_EMAILS]
col_index = client[MONGO_DB][MONGO_COL_INDEX]
# Indexy na attachment index kolekci
if not args.no_indexes:
col_index.create_index("filename")
col_index.create_index("mime_type")
# Dotaz — emaily s prilohou ktere jeste nebyly zpracovany
if args.force_recheck:
query = {"has_attachments": True}
else:
query = {
"has_attachments": True,
"attachments": {
"$elemMatch": {
"is_inline": False,
"file_hash": {"$exists": False},
}
}
}
total = col_emails.count_documents(query)
print(f"\nEmailu ke zpracovani: {total}")
if total == 0:
print("Neni co stahnout.")
client.close()
return
cursor = col_emails.find(query, {"_id": 1, "graph_id": 1, "subject": 1, "attachments": 1})
if args.limit:
cursor = cursor.limit(args.limit)
ok_count = 0
new_count = 0
skip_count = 0
err_count = 0
email_i = 0
batch = []
def flush():
if not batch:
return
try:
col_emails.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
for email_doc in cursor:
email_i += 1
email_id = email_doc["_id"]
graph_id = email_doc.get("graph_id", "")
subject = (email_doc.get("subject") or "")[:60]
att_list = email_doc.get("attachments") or []
# Jen skutecne prilohy
real_atts = [a for a in att_list if not a.get("is_inline", False)]
if not real_atts:
continue
print(f"\n {email_i:>5}/{total} {subject}")
# Nacti attachment IDs z Graph API
graph_atts = fetch_message_attachments(graph_id)
graph_att_map = {a["name"]: a for a in graph_atts if not a.get("isInline", False)}
updated_atts = list(att_list)
email_ok = True
for i, att in enumerate(updated_atts):
if att.get("is_inline", False):
continue
if not args.force_recheck and att.get("file_hash"):
skip_count += 1
print(f" SKIP {att['filename']}")
continue
att_name = att.get("filename", "")
graph_att = graph_att_map.get(att_name)
if not graph_att:
# Zkus najit podle casti nazvu
for gname, ga in graph_att_map.items():
if att_name.lower() in gname.lower():
graph_att = ga
break
if not graph_att:
logging.error("attachment not found in Graph [email=%s att=%s]", email_id, att_name)
print(f" ERR {att_name} (nenalezeno v Graph)")
err_count += 1
email_ok = False
continue
# Stahni obsah
content = fetch_attachment_content(graph_id, graph_att["id"])
if content is None:
err_count += 1
email_ok = False
print(f" ERR {att_name} (stazeni selhalo)")
continue
# Uloz s dedupem
hash_val, local_path, was_new = save_attachment(content, att_name, ATTACHMENTS_DIR, col_index)
# Aktualizuj MIME typ v indexu
col_index.update_one(
{"_id": hash_val},
{"$set": {"mime_type": att.get("mime_type", graph_att.get("contentType", ""))}},
)
# Zaznamenej do emailu
updated_atts[i] = {**att, "file_hash": hash_val, "local_path": local_path}
if was_new:
new_count += 1
print(f" NEW {local_path} ({len(content):,} B)")
else:
skip_count += 1
print(f" DUP {att_name} -> {local_path}")
if email_ok:
ok_count += 1
# Uloz aktualizovane prilohy zpet do emailu
batch.append(UpdateOne(
{"_id": email_id},
{"$set": {"attachments": updated_atts}}
))
if len(batch) >= BATCH_SIZE:
flush()
if email_i % 100 == 0:
elapsed = (datetime.now() - start).total_seconds()
print(f" {''*60}")
print(f" Průběh: emaily={email_i}/{total} nove={new_count} dup={skip_count} err={err_count}")
print(f" {''*60}")
flush()
elapsed_total = (datetime.now() - start).total_seconds()
files_total = col_index.count_documents({})
size_total = sum(d.get("size_bytes", 0) for d in col_index.find({}, {"size_bytes": 1}))
print(f"\n{'='*52}")
print(f"Vysledek: emaily={ok_count} | nove soubory={new_count} | duplikaty={skip_count} | err={err_count}")
print(f"Souboru v indexu: {files_total} ({size_total/1024/1024:.1f} MB)")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
@@ -0,0 +1,428 @@
"""
download_attachments_v1.1.py
Nazev: download_attachments_v1.1.py
Verze: 1.1
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Stahuje skutecne prilohy (is_inline=False) vsech emailu z MongoDB
pres Microsoft Graph API a uklada je do adresare
/mnt/Emails/<schránka>/Attachments/.
Schránka se predava jako povinny parametr --mailbox.
Deduplikace podle SHA256 hashe obsahu:
- stejny hash = soubor uz existuje -> preskoci
- prvni vyskytu souboru: ulozi pod puvodnimnazvem
- kolize nazvu (stejny nazev, jiny hash): faktura_2.pdf, faktura_3.pdf ...
Po ulozeni aktualizuje MongoDB:
- v email dokumentu: kazda priloha dostane file_hash + local_path
- kolekce emaily.attachments_index: _id=hash, filename, path, size_bytes,
mime_type, mailbox, first_seen_at, ref_count
Bezpecne prerusit a opakovat — emaily kde vsechny prilohy maji file_hash
se preskoci. --force-recheck znovu overi i uz stazene.
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
python download_attachments_v1.1.py --mailbox ordinace@buzalkova.cz
python download_attachments_v1.1.py --mailbox vladimir.buzalka@buzalka.cz --limit 50
python download_attachments_v1.1.py --mailbox ordinace@buzalkova.cz --force-recheck
Docker:
docker exec -it python-runner python /scripts/download_attachments_v1.1.py \\
--mailbox ordinace@buzalkova.cz
Zavislosti:
msal, requests, pymongo
Python 3.10+
Struktura na disku:
/mnt/Emails/
└── <mailbox>/
└── Attachments/
├── faktura_2026.pdf
├── vysledky_lab.pdf
├── vysledky_lab_2.pdf
└── ...
Kolekce emaily.attachments_index:
_id SHA256 hash (hex)
filename nazev souboru na disku
local_path relativni cesta od Attachments/
size_bytes velikost souboru
mime_type MIME typ
mailbox schránka ze ktere pochazi prvni vyskytu
first_seen_at datetime UTC
ref_count v kolika emailech se tato priloha vyskytuje
Historie verzi:
1.0 2026-06-02 Inicialni verze
1.1 2026-06-02 Schránka jako parametr --mailbox (univerzalni pouziti)
"""
import sys
import hashlib
import logging
import argparse
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from pymongo import MongoClient, UpdateOne
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL_INDEX = "attachments_index"
EMAILS_BASE_DIR = Path("/mnt/Emails")
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.1"
BATCH_SIZE = 50
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
_graph_token: Optional[str] = None
# ─── Graph API ────────────────────────────────────────────────────────────────
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get_bytes(url: str) -> bytes:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, timeout=120, stream=True)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.content
raise RuntimeError(f"Graph GET bytes failed: {url}")
def graph_get_json(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET json failed: {url}")
def fetch_message_attachments(mailbox: str, graph_message_id: str) -> list[dict]:
url = f"{GRAPH_URL}/users/{mailbox}/messages/{graph_message_id}/attachments"
try:
data = graph_get_json(url, {"$select": "id,name,contentType,size,isInline,contentId"})
return data.get("value", [])
except Exception as e:
logging.error("fetch_message_attachments failed [%s]: %s", graph_message_id, e)
return []
def fetch_attachment_content(mailbox: str, graph_message_id: str, attachment_id: str) -> Optional[bytes]:
url = f"{GRAPH_URL}/users/{mailbox}/messages/{graph_message_id}/attachments/{attachment_id}/$value"
try:
return graph_get_bytes(url)
except Exception as e:
logging.error("fetch_attachment_content failed [msg=%s att=%s]: %s", graph_message_id, attachment_id, e)
return None
# ─── Dedup + ukládání ─────────────────────────────────────────────────────────
def sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def safe_filename(name: str) -> str:
safe = "".join(c if c.isalnum() or c in "._- " else "_" for c in name).strip()
return safe or "attachment"
def resolve_filename(desired_name: str, att_dir: Path, hash_val: str, col_index) -> str:
"""Vrati nazev souboru pro ulozeni — resi kolize (stejny nazev, jiny hash)."""
existing = col_index.find_one({"filename": desired_name})
if existing:
if existing["_id"] == hash_val:
return desired_name # Dedup hit — stejny hash
# Kolize — hledej volny suffix
stem = Path(desired_name).stem
suffix = Path(desired_name).suffix
n = 2
while True:
candidate = f"{stem}_{n}{suffix}"
ex2 = col_index.find_one({"filename": candidate})
if not ex2 or ex2["_id"] == hash_val:
if not (att_dir / candidate).exists() or (ex2 and ex2["_id"] == hash_val):
return candidate
n += 1
return desired_name
def save_attachment(
content: bytes,
original_name: str,
mime_type: str,
mailbox: str,
att_dir: Path,
col_index,
) -> tuple[str, str, bool]:
"""
Ulozi prilohu s deduplikaci.
Vraci (hash, local_path, was_new).
"""
hash_val = sha256(content)
existing = col_index.find_one({"_id": hash_val})
if existing:
col_index.update_one({"_id": hash_val}, {"$inc": {"ref_count": 1}})
return hash_val, existing["local_path"], False
filename = resolve_filename(safe_filename(original_name), att_dir, hash_val, col_index)
file_path = att_dir / filename
file_path.write_bytes(content)
col_index.insert_one({
"_id": hash_val,
"filename": filename,
"local_path": filename,
"size_bytes": len(content),
"mime_type": mime_type,
"mailbox": mailbox,
"first_seen_at": datetime.now(timezone.utc).replace(tzinfo=None),
"ref_count": 1,
})
return hash_val, filename, True
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"download_attachments v{SCRIPT_VERSION}")
ap.add_argument("--mailbox", required=True,
help="Emailova schranka (napr. ordinace@buzalkova.cz)")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N emailu (0 = vse)")
ap.add_argument("--force-recheck", action="store_true",
help="Znovu overi i emaily kde prilohy uz maji file_hash")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na attachments_index kolekci")
args = ap.parse_args()
mailbox = args.mailbox
att_dir = EMAILS_BASE_DIR / mailbox / "Attachments"
mongo_col = mailbox
start = datetime.now()
print(f"=== download_attachments v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {mailbox}")
print(f"Cilovy adresar: {att_dir}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{mongo_col}")
att_dir.mkdir(parents=True, exist_ok=True)
print(" Adresar OK")
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col_emails = client[MONGO_DB][mongo_col]
col_index = client[MONGO_DB][MONGO_COL_INDEX]
if not args.no_indexes:
col_index.create_index("filename")
col_index.create_index("mime_type")
col_index.create_index("mailbox")
# Dotaz
if args.force_recheck:
query = {"has_attachments": True}
else:
query = {
"has_attachments": True,
"attachments": {
"$elemMatch": {
"is_inline": False,
"file_hash": {"$exists": False},
}
}
}
total = col_emails.count_documents(query)
print(f"\nEmailu ke zpracovani: {total}")
if total == 0:
print("Neni co stahnout.")
client.close()
return
cursor = col_emails.find(query, {"_id": 1, "graph_id": 1, "subject": 1, "attachments": 1})
if args.limit:
cursor = cursor.limit(args.limit)
ok_count = 0
new_count = 0
dup_count = 0
err_count = 0
email_i = 0
batch = []
def flush():
if not batch:
return
try:
col_emails.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
for email_doc in cursor:
email_i += 1
email_id = email_doc["_id"]
graph_id = email_doc.get("graph_id", "")
subject = (email_doc.get("subject") or "")[:60]
att_list = email_doc.get("attachments") or []
real_atts = [a for a in att_list if not a.get("is_inline", False)]
if not real_atts:
continue
print(f"\n {email_i:>5}/{total} {subject}")
graph_atts = fetch_message_attachments(mailbox, graph_id)
graph_att_map = {a["name"]: a for a in graph_atts if not a.get("isInline", False)}
updated_atts = list(att_list)
email_ok = True
for i, att in enumerate(updated_atts):
if att.get("is_inline", False):
continue
if not args.force_recheck and att.get("file_hash"):
print(f" SKIP {att['filename']}")
continue
att_name = att.get("filename", "")
graph_att = graph_att_map.get(att_name)
if not graph_att:
for gname, ga in graph_att_map.items():
if att_name.lower() in gname.lower():
graph_att = ga
break
if not graph_att:
logging.error("attachment not found in Graph [email=%s att=%s]", email_id, att_name)
print(f" ERR {att_name} (nenalezeno v Graph)")
err_count += 1
email_ok = False
continue
content = fetch_attachment_content(mailbox, graph_id, graph_att["id"])
if content is None:
err_count += 1
email_ok = False
print(f" ERR {att_name} (stazeni selhalo)")
continue
mime_type = att.get("mime_type") or graph_att.get("contentType", "")
hash_val, local_path, was_new = save_attachment(
content, att_name, mime_type, mailbox, att_dir, col_index
)
updated_atts[i] = {**att, "file_hash": hash_val, "local_path": local_path}
if was_new:
new_count += 1
print(f" NEW {local_path} ({len(content):,} B)")
else:
dup_count += 1
print(f" DUP {att_name} -> {local_path}")
if email_ok:
ok_count += 1
batch.append(UpdateOne({"_id": email_id}, {"$set": {"attachments": updated_atts}}))
if len(batch) >= BATCH_SIZE:
flush()
if email_i % 100 == 0:
elapsed = (datetime.now() - start).total_seconds()
print(f" {''*60}")
print(f" Průběh: emaily={email_i}/{total} nove={new_count} dup={dup_count} err={err_count}")
print(f" {''*60}")
flush()
elapsed_total = (datetime.now() - start).total_seconds()
files_total = col_index.count_documents({})
size_total = sum(d.get("size_bytes", 0) for d in col_index.find({}, {"size_bytes": 1}))
print(f"\n{'='*52}")
print(f"Vysledek: emaily={ok_count} | nove={new_count} | dup={dup_count} | err={err_count}")
print(f"Souboru v indexu: {files_total} ({size_total / 1024 / 1024:.1f} MB)")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
@@ -0,0 +1,466 @@
"""
download_attachments_v1.2.py
Nazev: download_attachments_v1.2.py
Verze: 1.2
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Stahuje skutecne prilohy (is_inline=False) vsech emailu z MongoDB
pres Microsoft Graph API a uklada je do adresare
/mnt/Emails/<schránka>/Attachments/.
Schránka se predava jako povinny parametr --mailbox.
Deduplikace podle SHA256 hashe obsahu:
- stejny hash = soubor uz existuje -> preskoci
- prvni vyskytu souboru: ulozi pod puvodnimnazvem
- kolize nazvu (stejny nazev, jiny hash): faktura_2.pdf, faktura_3.pdf ...
Po ulozeni aktualizuje MongoDB:
- v email dokumentu: kazda priloha dostane file_hash + local_path
- kolekce emaily.attachments_index: _id=hash, filename, path, size_bytes,
mime_type, mailbox, first_seen_at, ref_count
Bezpecne prerusit a opakovat — emaily kde vsechny prilohy maji file_hash
se preskoci. --force-recheck znovu overi i uz stazene.
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
python download_attachments_v1.2.py --mailbox ordinace@buzalkova.cz
python download_attachments_v1.2.py --mailbox ordinace@buzalkova.cz --limit 50
python download_attachments_v1.2.py --mailbox ordinace@buzalkova.cz --force-recheck
Docker:
docker exec -it python-runner python /scripts/download_attachments_v1.2.py \\
--mailbox ordinace@buzalkova.cz
Zavislosti:
msal, requests, pymongo
Python 3.10+
Historie verzi:
1.0 2026-06-02 Inicialni verze
1.1 2026-06-02 Schránka jako parametr --mailbox
1.2 2026-06-02 Oprava: Graph attachment mapa vcetne inline (fix ERR pri
inline obrazcich ulozených jako is_inline=False v MongoDB);
normalizace nazvu pro robustni porovnani; preskoceni S/MIME
(.p7m/.p7s); pokud Graph oznaci jako inline -> SKIP ne ERR
"""
import sys
import re
import hashlib
import logging
import argparse
import unicodedata
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from pymongo import MongoClient, UpdateOne
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL_INDEX = "attachments_index"
EMAILS_BASE_DIR = Path("/mnt/Emails")
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.2"
BATCH_SIZE = 50
# Typy příloh které přeskočíme (S/MIME podpisy, certifikáty)
SKIP_EXTENSIONS = {".p7m", ".p7s", ".p7c", ".p7b"}
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
_graph_token: Optional[str] = None
# ─── Graph API ────────────────────────────────────────────────────────────────
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get_bytes(url: str) -> bytes:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, timeout=120, stream=True)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.content
raise RuntimeError(f"Graph GET bytes failed: {url}")
def graph_get_json(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET json failed: {url}")
def fetch_message_attachments(mailbox: str, graph_message_id: str) -> list[dict]:
"""Nacte VSECHNY prilohy zpravy (vcetne inline) — filtrovani az pozdeji."""
url = f"{GRAPH_URL}/users/{mailbox}/messages/{graph_message_id}/attachments"
try:
data = graph_get_json(url, {"$select": "id,name,contentType,size,isInline,contentId"})
return data.get("value", [])
except Exception as e:
logging.error("fetch_message_attachments failed [%s]: %s", graph_message_id, e)
return []
def fetch_attachment_content(mailbox: str, graph_message_id: str, attachment_id: str) -> Optional[bytes]:
url = f"{GRAPH_URL}/users/{mailbox}/messages/{graph_message_id}/attachments/{attachment_id}/$value"
try:
return graph_get_bytes(url)
except Exception as e:
logging.error("fetch_attachment_content failed [msg=%s att=%s]: %s",
graph_message_id, attachment_id, e)
return None
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def normalize_name(name: str) -> str:
"""Normalizuje název pro porovnání — lowercase, bez diakritiky, jen alnum+._-"""
nfkd = unicodedata.normalize("NFKD", name.lower().strip())
ascii_str = "".join(c for c in nfkd if not unicodedata.combining(c))
return re.sub(r"[^\w.\-]", "_", ascii_str)
def find_graph_att(att_name: str, att_size: int, graph_atts: list[dict]) -> Optional[dict]:
"""
Hleda prilohu v Graph listu.
1. Presna shoda jmena
2. Normalizovana shoda jmena
3. Shoda jmena + velikosti (toleruje drobne rozdily v nazvu)
"""
# 1. Presna shoda
for ga in graph_atts:
if ga["name"] == att_name:
return ga
norm_want = normalize_name(att_name)
# 2. Normalizovana shoda
for ga in graph_atts:
if normalize_name(ga["name"]) == norm_want:
return ga
# 3. Normalizovana shoda + velikost (±10 %)
for ga in graph_atts:
if normalize_name(ga["name"]) == norm_want:
ga_size = ga.get("size", 0)
if att_size == 0 or ga_size == 0 or abs(ga_size - att_size) / max(ga_size, att_size) < 0.1:
return ga
# 4. Castecna shoda sufixu (posledních 20 znaků normalizovaného jména)
for ga in graph_atts:
if norm_want[-20:] and normalize_name(ga["name"]).endswith(norm_want[-20:]):
return ga
return None
def sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def safe_filename(name: str) -> str:
safe = "".join(c if c.isalnum() or c in "._- ()" else "_" for c in name).strip()
return safe or "attachment"
def resolve_filename(desired_name: str, att_dir: Path, hash_val: str, col_index) -> str:
existing = col_index.find_one({"filename": desired_name})
if existing:
if existing["_id"] == hash_val:
return desired_name
stem = Path(desired_name).stem
suffix = Path(desired_name).suffix
n = 2
while True:
candidate = f"{stem}_{n}{suffix}"
ex2 = col_index.find_one({"filename": candidate})
if not ex2 or ex2["_id"] == hash_val:
if not (att_dir / candidate).exists() or (ex2 and ex2["_id"] == hash_val):
return candidate
n += 1
return desired_name
def save_attachment(
content: bytes,
original_name: str,
mime_type: str,
mailbox: str,
att_dir: Path,
col_index,
) -> tuple[str, str, bool]:
hash_val = sha256(content)
existing = col_index.find_one({"_id": hash_val})
if existing:
col_index.update_one({"_id": hash_val}, {"$inc": {"ref_count": 1}})
return hash_val, existing["local_path"], False
filename = resolve_filename(safe_filename(original_name), att_dir, hash_val, col_index)
file_path = att_dir / filename
file_path.write_bytes(content)
col_index.insert_one({
"_id": hash_val,
"filename": filename,
"local_path": filename,
"size_bytes": len(content),
"mime_type": mime_type,
"mailbox": mailbox,
"first_seen_at": datetime.now(timezone.utc).replace(tzinfo=None),
"ref_count": 1,
})
return hash_val, filename, True
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"download_attachments v{SCRIPT_VERSION}")
ap.add_argument("--mailbox", required=True,
help="Emailova schranka (napr. ordinace@buzalkova.cz)")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N emailu (0 = vse)")
ap.add_argument("--force-recheck", action="store_true",
help="Znovu overi i emaily kde prilohy uz maji file_hash")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na attachments_index kolekci")
args = ap.parse_args()
mailbox = args.mailbox
att_dir = EMAILS_BASE_DIR / mailbox / "Attachments"
mongo_col = mailbox
start = datetime.now()
print(f"=== download_attachments v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {mailbox}")
print(f"Cilovy adresar: {att_dir}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{mongo_col}")
att_dir.mkdir(parents=True, exist_ok=True)
print(" Adresar OK")
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col_emails = client[MONGO_DB][mongo_col]
col_index = client[MONGO_DB][MONGO_COL_INDEX]
if not args.no_indexes:
col_index.create_index("filename")
col_index.create_index("mime_type")
col_index.create_index("mailbox")
if args.force_recheck:
query = {"has_attachments": True}
else:
query = {
"has_attachments": True,
"attachments": {
"$elemMatch": {
"is_inline": False,
"file_hash": {"$exists": False},
}
}
}
total = col_emails.count_documents(query)
print(f"\nEmailu ke zpracovani: {total}")
if total == 0:
print("Neni co stahnout.")
client.close()
return
cursor = col_emails.find(query, {"_id": 1, "graph_id": 1, "subject": 1, "attachments": 1})
if args.limit:
cursor = cursor.limit(args.limit)
ok_count = 0
new_count = 0
dup_count = 0
skip_count = 0
err_count = 0
email_i = 0
batch = []
def flush():
if not batch:
return
try:
col_emails.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
for email_doc in cursor:
email_i += 1
email_id = email_doc["_id"]
graph_id = email_doc.get("graph_id", "")
subject = (email_doc.get("subject") or "")[:60]
att_list = email_doc.get("attachments") or []
real_atts = [a for a in att_list if not a.get("is_inline", False)]
if not real_atts:
continue
print(f"\n {email_i:>5}/{total} {subject}")
# Nacti VSECHNY prilohy z Graph (vcetne inline — potrebujeme je pro matching)
graph_atts = fetch_message_attachments(mailbox, graph_id)
updated_atts = list(att_list)
email_ok = True
for i, att in enumerate(updated_atts):
if att.get("is_inline", False):
continue
if not args.force_recheck and att.get("file_hash"):
continue
att_name = att.get("filename", "")
att_size = att.get("size_bytes", 0)
# Preskoc S/MIME podpisy
if Path(att_name).suffix.lower() in SKIP_EXTENSIONS:
updated_atts[i] = {**att, "file_hash": "skip", "local_path": ""}
skip_count += 1
print(f" SKIP {att_name} (S/MIME)")
continue
# Najdi prilohu v Graph
graph_att = find_graph_att(att_name, att_size, graph_atts)
if not graph_att:
logging.error("attachment not found [email=%s att=%s]", email_id, att_name)
print(f" ERR {att_name} (nenalezeno)")
err_count += 1
email_ok = False
continue
# Pokud Graph rika ze je inline — preskoc, nestahujem
if graph_att.get("isInline", False):
updated_atts[i] = {**att, "is_inline": True, "file_hash": "skip", "local_path": ""}
skip_count += 1
print(f" SKIP {att_name} (inline obrazek)")
continue
content = fetch_attachment_content(mailbox, graph_id, graph_att["id"])
if content is None:
err_count += 1
email_ok = False
print(f" ERR {att_name} (stazeni selhalo)")
continue
mime_type = att.get("mime_type") or graph_att.get("contentType", "")
hash_val, local_path, was_new = save_attachment(
content, att_name, mime_type, mailbox, att_dir, col_index
)
updated_atts[i] = {**att, "file_hash": hash_val, "local_path": local_path}
if was_new:
new_count += 1
print(f" NEW {local_path} ({len(content):,} B)")
else:
dup_count += 1
print(f" DUP {att_name} -> {local_path}")
if email_ok:
ok_count += 1
batch.append(UpdateOne({"_id": email_id}, {"$set": {"attachments": updated_atts}}))
if len(batch) >= BATCH_SIZE:
flush()
if email_i % 100 == 0:
elapsed = (datetime.now() - start).total_seconds()
print(f" {''*60}")
print(f" Průběh: emaily={email_i}/{total} nove={new_count} dup={dup_count} skip={skip_count} err={err_count}")
print(f" {''*60}")
flush()
elapsed_total = (datetime.now() - start).total_seconds()
files_total = col_index.count_documents({})
size_total = sum(d.get("size_bytes", 0) for d in col_index.find({}, {"size_bytes": 1}))
print(f"\n{'='*52}")
print(f"Vysledek: emaily={ok_count} | nove={new_count} | dup={dup_count} | skip={skip_count} | err={err_count}")
print(f"Souboru v indexu: {files_total} ({size_total / 1024 / 1024:.1f} MB)")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
@@ -0,0 +1,560 @@
"""
parse_emails_graph_v1.0.py
Nazev: parse_emails_graph_v1.0.py
Verze: 1.0
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Cte vsechny emaily ze schranky ordinace@buzalkova.cz primo pres
Microsoft Graph API a importuje je jako dokumenty do MongoDB.
Ze kazde zpravy extrahuje vsechny dostupne vlastnosti:
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
- cas doruceni, odeslani, vytvoreni, modifikace (UTC)
- telo HTML (max 2 MB) + textovy preview
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag)
- internet headers (SPF, DKIM, Received, X-*, ...)
- MAPI-ekvivalenty: dulezitost, priznak, konverzacni vlakno,
kategorie, In-Reply-To, References, ...
- navic: isRead, isDraft, folder_path, inferenceClassification
Prochazi VSECHNY slozky schranky rekurzivne (Inbox, Sent, Deleted,
archivni slozky, ...).
DB: emaily
Kolekce: ordinace@buzalkova.cz
_id: Internet Message-ID (nebo "graphid:<id>" jako fallback)
Bezpecne prerusit a opakovat:
- upsert podle _id — duplicity se automaticky prepisi
- --skip-existing nacte seznam hotovych _id z MongoDB a preskoci je
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
python parse_emails_graph_v1.0.py # kompletni import
python parse_emails_graph_v1.0.py --limit 50 # test na prvnich 50
python parse_emails_graph_v1.0.py --skip-existing # pokracovani po preruseni
python parse_emails_graph_v1.0.py --folder Inbox # jen jedna slozka
python parse_emails_graph_v1.0.py --no-indexes # bez indexu na konci
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
Struktura dokumentu v MongoDB:
_id Internet Message-ID (nebo graphid: fallback)
graph_id Graph API message ID (pro pripadne dalsi operace)
subject predmet zpravy
normalized_subject predmet bez RE:/FW:/AW: prefixu
importance 0=nizka 1=normalni 2=vysoka
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
is_read bool — aktualni stav precteni ve schrance
is_draft bool
has_attachments bool
attachment_count int
inference_classification focused / other (Outlook AI trideni)
categories [str]
conversation_id Graph conversationId
conversation_index base64 conversationIndex
conversation_topic tema vlakna (z internet headers Thread-Topic)
in_reply_to Message-ID predchozi zpravy
internet_references [Message-ID] — cela historia vlakna
received_at datetime UTC
sent_at datetime UTC
created_at datetime UTC — cas vytvoreni zaznamu v M365
modified_at datetime UTC — cas posledni modifikace
folder_id Graph parentFolderId
folder_path cela cesta slozky (napr. Inbox/Subfolder)
sender.email emailova adresa odesilatele
sender.name zobrazovane jmeno odesilatele
to retezec To (joined)
cc retezec CC
bcc retezec BCC
recipients [{type, email, name}] — to/cc/bcc s typy
body_html HTML telo (max 2 MB)
body_preview textovy nahled (max 255 znaku z Graph)
attachments [{filename, size_bytes, mime_type,
content_id, is_inline}]
headers dict internet headers (lowercase_s_podtrzitky)
parsed_at datetime UTC — cas parsovani
Indexy:
received_at, sent_at, sender.email, graph_id (unique),
conversation_id, folder_path, has_attachments, categories,
importance, flag_status, is_read,
text_search (subject + body_preview + to + cc)
Historie verzi:
1.0 2026-06-02 Inicialni verze — Graph API jako zdroj
"""
import sys
import re
import logging
import argparse
import base64
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_MAILBOX = "ordinace@buzalkova.cz"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL = "ordinace@buzalkova.cz"
BATCH_SIZE = 100
PAGE_SIZE = 50
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.0"
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
IMPORTANCE_MAP = {"low": 0, "normal": 1, "high": 2}
FLAG_STATUS_MAP = {"notFlagged": 0, "flagged": 1, "complete": 2}
RE_SUBJECT = re.compile(r"^(RE|FW|AW|SV|VS|TR|WG|odpov[eě]d[ťt]|fwd?)[:\s]+", re.IGNORECASE)
MSG_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification,internetMessageHeaders"
)
# ─── Graph API helpers ────────────────────────────────────────────────────────
_graph_token: Optional[str] = None
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET failed after retry: {url}")
def get_all_folders(parent_id: str = None, parent_path: str = "") -> list[dict]:
"""Rekurzivne nacte vsechny slozky schranky. Vraci [{id, path}]."""
if parent_id is None:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders"
else:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{parent_id}/childFolders"
folders = []
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
while url:
data = graph_get(url, params)
for f in data.get("value", []):
path = f"{parent_path}/{f['displayName']}".lstrip("/")
folders.append({"id": f["id"], "path": path})
if f.get("childFolderCount", 0) > 0:
folders.extend(get_all_folders(f["id"], path))
url = data.get("@odata.nextLink")
params = None
return folders
def iter_folder_messages(folder_id: str):
"""Generator: vraci zpravy ze slozky po strankach."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{folder_id}/messages"
params = {"$top": PAGE_SIZE, "$select": MSG_SELECT, "$expand": "attachments"}
while url:
data = graph_get(url, params)
for msg in data.get("value", []):
yield msg
url = data.get("@odata.nextLink")
params = None
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def parse_date(raw) -> Optional[datetime]:
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
def normalize_subject(subject: str) -> str:
s = subject.strip()
while True:
m = RE_SUBJECT.match(s)
if not m:
break
s = s[m.end():].strip()
return s
def parse_headers(raw_headers: list) -> dict:
result = {}
for h in raw_headers:
k = h["name"].lower().replace("-", "_")
v = h["value"]
if k in result:
existing = result[k]
if isinstance(existing, list):
existing.append(v)
else:
result[k] = [existing, v]
else:
result[k] = v
return result
def format_recipients(lst: list) -> str:
return "; ".join(
f'{r["emailAddress"].get("name", "")} <{r["emailAddress"].get("address", "")}>'.strip()
for r in lst
)
# ─── Hlavní extrakce ─────────────────────────────────────────────────────────
def extract_message(msg: dict, folder_path: str) -> Optional[dict]:
try:
# _id
mid = (msg.get("internetMessageId") or "").strip()
if not mid:
mid = f"graphid:{msg['id']}"
subject = msg.get("subject") or ""
norm_subject = normalize_subject(subject)
# tělo
body_html = None
body_preview = msg.get("bodyPreview") or ""
body = msg.get("body", {})
if body.get("contentType") == "html":
content = body.get("content") or ""
body_html = content if len(content) <= 2 * 1024 * 1024 else content[:2 * 1024 * 1024]
elif body.get("contentType") == "text":
body_preview = (body.get("content") or "")[:2000]
# odesílatel
sender_ea = (msg.get("from") or msg.get("sender") or {}).get("emailAddress", {})
sender_email = sender_ea.get("address", "")
sender_name = sender_ea.get("name", "")
# příjemci
to_list = msg.get("toRecipients", [])
cc_list = msg.get("ccRecipients", [])
bcc_list = msg.get("bccRecipients", [])
recipients = (
[{"type": "to", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in to_list] +
[{"type": "cc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in cc_list] +
[{"type": "bcc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in bcc_list]
)
# příznaky
importance = IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1)
flag_status = FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0)
# internet headers
raw_headers = msg.get("internetMessageHeaders") or []
headers = parse_headers(raw_headers)
in_reply_to = headers.get("in_reply_to", "")
if isinstance(in_reply_to, list):
in_reply_to = in_reply_to[0]
refs_raw = headers.get("references", "")
if isinstance(refs_raw, list):
refs_raw = " ".join(refs_raw)
internet_refs = [r.strip() for r in refs_raw.split() if r.strip()] if refs_raw else []
conv_topic = headers.get("thread_topic", "")
if isinstance(conv_topic, list):
conv_topic = conv_topic[0]
# conversation index
conv_index = ""
ci_raw = msg.get("conversationIndex")
if ci_raw:
try:
conv_index = base64.b64encode(base64.b64decode(ci_raw)).decode()
except Exception:
conv_index = ci_raw
# přílohy (jen metadata, bez obsahu)
attachments = []
for att in msg.get("attachments") or []:
fname = att.get("name") or ""
if not fname:
continue
attachments.append({
"filename": fname,
"size_bytes": att.get("size", 0),
"mime_type": att.get("contentType", "application/octet-stream"),
"content_id": att.get("contentId"),
"is_inline": att.get("isInline", False),
})
return {
"_id": mid,
"graph_id": msg["id"],
"subject": subject,
"normalized_subject": norm_subject,
"importance": importance,
"flag_status": flag_status,
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"has_attachments": msg.get("hasAttachments", False),
"attachment_count": len(attachments),
"inference_classification": msg.get("inferenceClassification", ""),
"categories": msg.get("categories") or [],
"conversation_id": msg.get("conversationId", ""),
"conversation_index": conv_index,
"conversation_topic": conv_topic,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"received_at": parse_date(msg.get("receivedDateTime")),
"sent_at": parse_date(msg.get("sentDateTime")),
"created_at": parse_date(msg.get("createdDateTime")),
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"sender": {
"email": sender_email,
"name": sender_name,
},
"to": format_recipients(to_list),
"cc": format_recipients(cc_list),
"bcc": format_recipients(bcc_list),
"recipients": recipients,
"body_html": body_html,
"body_preview": body_preview,
"attachments": attachments,
"headers": headers,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg.get("id", "?"), e)
return None
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
def create_indexes(col):
print(" Vytvarim indexy...")
col.create_index([("received_at", ASCENDING)])
col.create_index([("sent_at", ASCENDING)])
col.create_index([("sender.email", ASCENDING)])
col.create_index([("graph_id", ASCENDING)], unique=True, sparse=True)
col.create_index([("conversation_id", ASCENDING)])
col.create_index([("folder_path", ASCENDING)])
col.create_index([("has_attachments", ASCENDING)])
col.create_index([("categories", ASCENDING)])
col.create_index([("importance", ASCENDING)])
col.create_index([("flag_status", ASCENDING)])
col.create_index([("is_read", ASCENDING)])
col.create_index([
("subject", TEXT),
("body_preview", TEXT),
("to", TEXT),
("cc", TEXT),
], name="text_search", default_language="none")
print(" Indexy hotovy.")
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"parse_emails_graph v{SCRIPT_VERSION}")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N zprav (0 = vse)")
ap.add_argument("--skip-existing", action="store_true",
help="Preskocit zpravy ktere jiz jsou v MongoDB")
ap.add_argument("--folder", default="",
help="Zpracovat jen slozku se zadanym nazvem (napr. Inbox)")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
start = datetime.now()
print(f"=== parse_emails_graph v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {GRAPH_MAILBOX}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
# Graph token
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
# MongoDB
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col = client[MONGO_DB][MONGO_COL]
# Skip existing
existing: set = set()
if args.skip_existing:
print(" Nacitam existujici zaznamy z MongoDB...")
existing = set(col.distinct("_id"))
print(f" {len(existing)} jiz importovano")
# Slozky
print("\nNacitam seznam slozek...")
all_folders = get_all_folders()
if args.folder:
all_folders = [f for f in all_folders if args.folder.lower() in f["path"].lower()]
print(f" Slozek ke zpracovani: {len(all_folders)}")
for f in all_folders:
print(f" {f['path']}")
# Import
batch = []
ok_count = 0
err_count = 0
skip_count = 0
total_i = 0
def flush():
if not batch:
return
try:
col.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
print()
for folder in all_folders:
print(f"--- Složka: {folder['path']} ---")
folder_count = 0
for msg in iter_folder_messages(folder["id"]):
if args.limit and total_i >= args.limit:
break
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
if mid in existing:
skip_count += 1
total_i += 1
continue
doc = extract_message(msg, folder["path"])
total_i += 1
folder_count += 1
if doc is None:
err_count += 1
else:
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
ok_count += 1
if len(batch) >= BATCH_SIZE:
flush()
status = "ERR " if doc is None else "OK "
subject_str = (doc.get("subject") or "")[:60] if doc else "?"
sender_str = (doc.get("sender", {}).get("email") or "")[:40] if doc else "?"
print(f" {total_i:>6} {status} {subject_str:<60} {sender_str}")
if total_i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = total_i / elapsed if elapsed > 0 else 0
print(f" {''*80}")
print(f" Průběh: ok={ok_count} skip={skip_count} err={err_count} {rate:.1f} msg/s")
print(f" {''*80}")
flush()
print(f"{folder_count} zprav ze slozky {folder['path']}")
if args.limit and total_i >= args.limit:
break
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"Vysledek: ok={ok_count} | skip={skip_count} | err={err_count}")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"Dokumentu v kolekci: {col.count_documents({})}")
if not args.no_indexes:
print()
create_indexes(col)
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
@@ -0,0 +1,605 @@
"""
parse_emails_graph_v1.1.py
Nazev: parse_emails_graph_v1.1.py
Verze: 1.1
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Cte vsechny emaily ze schranky ordinace@buzalkova.cz primo pres
Microsoft Graph API a importuje je jako dokumenty do MongoDB.
Ze kazde zpravy extrahuje vsechny dostupne vlastnosti:
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
- cas doruceni, odeslani, vytvoreni, modifikace (UTC)
- telo HTML (max 2 MB) + textovy preview
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag)
- internet headers (SPF, DKIM, Received, X-*, ...)
- MAPI-ekvivalenty: dulezitost, priznak, konverzacni vlakno,
kategorie, In-Reply-To, References, ...
- navic: isRead, isDraft, folder_path, inferenceClassification
Prochazi VSECHNY slozky schranky rekurzivne (Inbox, Sent, Deleted,
archivni slozky, ...).
DB: emaily
Kolekce: ordinace@buzalkova.cz
_id: Internet Message-ID (nebo "graphid:<id>" jako fallback)
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
# Prvni import (vsechno):
python parse_emails_graph_v1.1.py
# Test na prvnich 50:
python parse_emails_graph_v1.1.py --limit 50 --no-indexes
# Jen jedna slozka:
python parse_emails_graph_v1.1.py --folder Inbox
# Pokracovani po preruseni (pouze nove):
python parse_emails_graph_v1.1.py --mode new-only
# Pravidelny sync (aktualizuje is_read, flag, slozku; importuje nove):
python parse_emails_graph_v1.1.py --mode sync
# Plny reimport vsech dat:
python parse_emails_graph_v1.1.py --mode full
Rezimy (--mode):
full Plny upsert vsech poli pro kazdou zpravu (vychozi)
new-only Preskoci zpravy ktere uz jsou v MongoDB, importuje jen nove
sync Existujici: aktualizuje jen is_read/flag_status/categories/
modified_at/folder_path. Nove zpravy importuje cely.
Idealni pro pravidelne spousteni.
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
Struktura dokumentu v MongoDB:
_id Internet Message-ID (nebo graphid: fallback)
graph_id Graph API message ID
subject predmet zpravy
normalized_subject predmet bez RE:/FW:/AW: prefixu
importance 0=nizka 1=normalni 2=vysoka
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
is_read bool — aktualni stav precteni ve schrance
is_draft bool
has_attachments bool
attachment_count int
inference_classification focused / other
categories [str]
conversation_id Graph conversationId
conversation_index base64 conversationIndex
conversation_topic tema vlakna (z internet headers Thread-Topic)
in_reply_to Message-ID predchozi zpravy
internet_references [Message-ID]
received_at datetime UTC
sent_at datetime UTC
created_at datetime UTC
modified_at datetime UTC
folder_id Graph parentFolderId
folder_path cela cesta slozky (napr. Inbox/Subfolder)
sender.email emailova adresa odesilatele
sender.name zobrazovane jmeno
to retezec To (joined)
cc retezec CC
bcc retezec BCC
recipients [{type, email, name}]
body_html HTML telo (max 2 MB)
body_preview textovy nahled (max 255 znaku)
attachments [{filename, size_bytes, mime_type, content_id, is_inline}]
headers dict internet headers
parsed_at datetime UTC
Indexy:
received_at, sent_at, sender.email, graph_id (unique),
conversation_id, folder_path, has_attachments, categories,
importance, flag_status, is_read,
text_search (subject + body_preview + to + cc)
Historie verzi:
1.0 2026-06-02 Inicialni verze
1.1 2026-06-02 Pridany rezimy --mode full/new-only/sync;
odstranen --skip-existing (nahrazen --mode new-only)
"""
import sys
import re
import logging
import argparse
import base64
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_MAILBOX = "ordinace@buzalkova.cz"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL = "ordinace@buzalkova.cz"
BATCH_SIZE = 100
PAGE_SIZE = 50
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.1"
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
IMPORTANCE_MAP = {"low": 0, "normal": 1, "high": 2}
FLAG_STATUS_MAP = {"notFlagged": 0, "flagged": 1, "complete": 2}
RE_SUBJECT = re.compile(r"^(RE|FW|AW|SV|VS|TR|WG|odpov[eě]d[ťt]|fwd?)[:\s]+", re.IGNORECASE)
MSG_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification,internetMessageHeaders"
)
# Pro sync mode staci jen menitelna pole — rychlejsi fetch
MSG_SELECT_SYNC = (
"id,internetMessageId,isRead,isDraft,flag,categories,"
"lastModifiedDateTime,parentFolderId,importance"
)
# ─── Graph API helpers ────────────────────────────────────────────────────────
_graph_token: Optional[str] = None
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET failed after retry: {url}")
def get_all_folders(parent_id: str = None, parent_path: str = "") -> list[dict]:
"""Rekurzivne nacte vsechny slozky schranky. Vraci [{id, path}]."""
if parent_id is None:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders"
else:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{parent_id}/childFolders"
folders = []
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
while url:
data = graph_get(url, params)
for f in data.get("value", []):
path = f"{parent_path}/{f['displayName']}".lstrip("/")
folders.append({"id": f["id"], "path": path})
if f.get("childFolderCount", 0) > 0:
folders.extend(get_all_folders(f["id"], path))
url = data.get("@odata.nextLink")
params = None
return folders
def iter_folder_messages(folder_id: str, select: str = MSG_SELECT, expand_attachments: bool = True):
"""Generator: vraci zpravy ze slozky po strankach."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{folder_id}/messages"
params = {"$top": PAGE_SIZE, "$select": select}
if expand_attachments:
params["$expand"] = "attachments"
while url:
data = graph_get(url, params)
for msg in data.get("value", []):
yield msg
url = data.get("@odata.nextLink")
params = None
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def parse_date(raw) -> Optional[datetime]:
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
def normalize_subject(subject: str) -> str:
s = subject.strip()
while True:
m = RE_SUBJECT.match(s)
if not m:
break
s = s[m.end():].strip()
return s
def parse_headers(raw_headers: list) -> dict:
result = {}
for h in raw_headers:
k = h["name"].lower().replace("-", "_")
v = h["value"]
if k in result:
existing = result[k]
result[k] = existing + [v] if isinstance(existing, list) else [existing, v]
else:
result[k] = v
return result
def format_recipients(lst: list) -> str:
return "; ".join(
f'{r["emailAddress"].get("name", "")} <{r["emailAddress"].get("address", "")}>'.strip()
for r in lst
)
# ─── Extrakce zprávy ─────────────────────────────────────────────────────────
def extract_message(msg: dict, folder_path: str) -> Optional[dict]:
"""Plna extrakce — pouziva se pro mode full a nove zpravy v sync/new-only."""
try:
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
subject = msg.get("subject") or ""
body_html = None
body_preview = msg.get("bodyPreview") or ""
body = msg.get("body", {})
if body.get("contentType") == "html":
content = body.get("content") or ""
body_html = content if len(content) <= 2 * 1024 * 1024 else content[:2 * 1024 * 1024]
elif body.get("contentType") == "text":
body_preview = (body.get("content") or "")[:2000]
sender_ea = (msg.get("from") or msg.get("sender") or {}).get("emailAddress", {})
to_list = msg.get("toRecipients", [])
cc_list = msg.get("ccRecipients", [])
bcc_list = msg.get("bccRecipients", [])
recipients = (
[{"type": "to", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in to_list] +
[{"type": "cc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in cc_list] +
[{"type": "bcc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in bcc_list]
)
importance = IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1)
flag_status = FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0)
raw_headers = msg.get("internetMessageHeaders") or []
headers = parse_headers(raw_headers)
in_reply_to = headers.get("in_reply_to", "")
if isinstance(in_reply_to, list):
in_reply_to = in_reply_to[0]
refs_raw = headers.get("references", "")
if isinstance(refs_raw, list):
refs_raw = " ".join(refs_raw)
internet_refs = [r.strip() for r in refs_raw.split() if r.strip()] if refs_raw else []
conv_topic = headers.get("thread_topic", "")
if isinstance(conv_topic, list):
conv_topic = conv_topic[0]
conv_index = ""
ci_raw = msg.get("conversationIndex")
if ci_raw:
try:
conv_index = base64.b64encode(base64.b64decode(ci_raw)).decode()
except Exception:
conv_index = ci_raw
attachments = []
for att in msg.get("attachments") or []:
fname = att.get("name") or ""
if not fname:
continue
attachments.append({
"filename": fname,
"size_bytes": att.get("size", 0),
"mime_type": att.get("contentType", "application/octet-stream"),
"content_id": att.get("contentId"),
"is_inline": att.get("isInline", False),
})
return {
"_id": mid,
"graph_id": msg["id"],
"subject": subject,
"normalized_subject": normalize_subject(subject),
"importance": importance,
"flag_status": flag_status,
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"has_attachments": msg.get("hasAttachments", False),
"attachment_count": len(attachments),
"inference_classification": msg.get("inferenceClassification", ""),
"categories": msg.get("categories") or [],
"conversation_id": msg.get("conversationId", ""),
"conversation_index": conv_index,
"conversation_topic": conv_topic,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"received_at": parse_date(msg.get("receivedDateTime")),
"sent_at": parse_date(msg.get("sentDateTime")),
"created_at": parse_date(msg.get("createdDateTime")),
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"sender": {
"email": sender_ea.get("address", ""),
"name": sender_ea.get("name", ""),
},
"to": format_recipients(to_list),
"cc": format_recipients(cc_list),
"bcc": format_recipients(bcc_list),
"recipients": recipients,
"body_html": body_html,
"body_preview": body_preview,
"attachments": attachments,
"headers": headers,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg.get("id", "?"), e)
return None
def extract_sync_fields(msg: dict, folder_path: str) -> dict:
"""Jen menitelna pole — pouziva se v sync mode pro existujici zpravy."""
return {
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"flag_status": FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0),
"importance": IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1),
"categories": msg.get("categories") or [],
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
def create_indexes(col):
print(" Vytvarim indexy...")
col.create_index([("received_at", ASCENDING)])
col.create_index([("sent_at", ASCENDING)])
col.create_index([("sender.email", ASCENDING)])
col.create_index([("graph_id", ASCENDING)], unique=True, sparse=True)
col.create_index([("conversation_id", ASCENDING)])
col.create_index([("folder_path", ASCENDING)])
col.create_index([("has_attachments", ASCENDING)])
col.create_index([("categories", ASCENDING)])
col.create_index([("importance", ASCENDING)])
col.create_index([("flag_status", ASCENDING)])
col.create_index([("is_read", ASCENDING)])
col.create_index([
("subject", TEXT),
("body_preview", TEXT),
("to", TEXT),
("cc", TEXT),
], name="text_search", default_language="none")
print(" Indexy hotovy.")
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"parse_emails_graph v{SCRIPT_VERSION}")
ap.add_argument("--mode", default="full", choices=["full", "new-only", "sync"],
help="full=plny upsert (vychozi) | new-only=jen nove zpravy | "
"sync=existujici aktualizuje jen menitelna pole, nove importuje cely")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N zprav (0 = vse)")
ap.add_argument("--folder", default="",
help="Zpracovat jen slozku se zadanym nazvem (napr. Inbox)")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
start = datetime.now()
print(f"=== parse_emails_graph v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {GRAPH_MAILBOX}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
print(f"Režim: {args.mode}")
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col = client[MONGO_DB][MONGO_COL]
# Existující _id (potřeba pro new-only a sync)
existing: set = set()
if args.mode in ("new-only", "sync"):
print(" Nacitam existujici zaznamy z MongoDB...")
existing = set(col.distinct("_id"))
print(f" {len(existing)} jiz importovano")
print("\nNacitam seznam slozek...")
all_folders = get_all_folders()
if args.folder:
all_folders = [f for f in all_folders if args.folder.lower() in f["path"].lower()]
print(f" Slozek ke zpracovani: {len(all_folders)}")
for f in all_folders:
print(f" {f['path']}")
# V sync mode fetchujeme jen menitelna pole
is_sync = args.mode == "sync"
msg_select = MSG_SELECT_SYNC if is_sync else MSG_SELECT
expand_att = not is_sync
batch = []
ok_count = 0
sync_count = 0
err_count = 0
skip_count = 0
total_i = 0
def flush():
if not batch:
return
try:
col.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
print()
for folder in all_folders:
print(f"--- Složka: {folder['path']} ---")
folder_count = 0
for msg in iter_folder_messages(folder["id"], select=msg_select, expand_attachments=expand_att):
if args.limit and total_i >= args.limit:
break
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
total_i += 1
folder_count += 1
if args.mode == "new-only" and mid in existing:
skip_count += 1
continue
if is_sync and mid in existing:
# Sync existujici — jen menitelna pole
fields = extract_sync_fields(msg, folder["path"])
batch.append(UpdateOne({"_id": mid}, {"$set": fields}))
sync_count += 1
status = "SYN "
print(f" {total_i:>6} {status} {mid[:80]}")
else:
# Full extract (new-only nove, sync nove, full vse)
# Pro sync nove zpravy potrebujeme plny fetch
if is_sync:
full_url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/messages/{msg['id']}"
full_params = {"$select": MSG_SELECT, "$expand": "attachments"}
try:
msg = graph_get(full_url, full_params)
except Exception as e:
logging.error("full fetch failed [%s]: %s", msg.get("id","?"), e)
err_count += 1
continue
doc = extract_message(msg, folder["path"])
if doc is None:
err_count += 1
status = "ERR "
print(f" {total_i:>6} {status} {mid[:80]}")
else:
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
ok_count += 1
status = "OK "
subject_str = (doc.get("subject") or "")[:60]
sender_str = (doc.get("sender", {}).get("email") or "")[:40]
print(f" {total_i:>6} {status} {subject_str:<60} {sender_str}")
if len(batch) >= BATCH_SIZE:
flush()
if total_i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = total_i / elapsed if elapsed > 0 else 0
print(f" {''*80}")
print(f" Průběh: ok={ok_count} sync={sync_count} skip={skip_count} err={err_count} {rate:.1f} msg/s")
print(f" {''*80}")
flush()
print(f"{folder_count} zprav ze slozky {folder['path']}")
if args.limit and total_i >= args.limit:
break
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"Vysledek: ok={ok_count} | sync={sync_count} | skip={skip_count} | err={err_count}")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"Dokumentu v kolekci: {col.count_documents({})}")
if not args.no_indexes:
print()
create_indexes(col)
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
@@ -0,0 +1,610 @@
"""
parse_emails_graph_v1.2.py
Nazev: parse_emails_graph_v1.2.py
Verze: 1.2
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Cte vsechny emaily ze schranky ordinace@buzalkova.cz primo pres
Microsoft Graph API a importuje je jako dokumenty do MongoDB.
Ze kazde zpravy extrahuje vsechny dostupne vlastnosti:
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
- cas doruceni, odeslani, vytvoreni, modifikace (UTC)
- telo HTML (max 2 MB) + textovy preview
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag, graph_att_id)
- internet headers (SPF, DKIM, Received, X-*, ...)
- MAPI-ekvivalenty: dulezitost, priznak, konverzacni vlakno,
kategorie, In-Reply-To, References, ...
- navic: isRead, isDraft, folder_path, inferenceClassification
Prochazi VSECHNY slozky schranky rekurzivne (Inbox, Sent, Deleted,
archivni slozky, ...).
DB: emaily
Kolekce: ordinace@buzalkova.cz
_id: Internet Message-ID (nebo "graphid:<id>" jako fallback)
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
# Prvni import (vsechno):
python parse_emails_graph_v1.2.py
# Test na prvnich 50:
python parse_emails_graph_v1.2.py --limit 50 --no-indexes
# Jen jedna slozka:
python parse_emails_graph_v1.2.py --folder Inbox
# Pokracovani po preruseni (pouze nove):
python parse_emails_graph_v1.2.py --mode new-only
# Pravidelny sync (aktualizuje is_read, flag, slozku; importuje nove):
python parse_emails_graph_v1.2.py --mode sync
# Plny reimport vsech dat:
python parse_emails_graph_v1.2.py --mode full
Rezimy (--mode):
full Plny upsert vsech poli pro kazdou zpravu (vychozi)
new-only Preskoci zpravy ktere uz jsou v MongoDB, importuje jen nove
sync Existujici: aktualizuje jen is_read/flag_status/categories/
modified_at/folder_path. Nove zpravy importuje cely.
Idealni pro pravidelne spousteni.
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
Struktura dokumentu v MongoDB:
_id Internet Message-ID (nebo graphid: fallback)
graph_id Graph API message ID
subject predmet zpravy
normalized_subject predmet bez RE:/FW:/AW: prefixu
importance 0=nizka 1=normalni 2=vysoka
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
is_read bool — aktualni stav precteni ve schrance
is_draft bool
has_attachments bool
attachment_count int
inference_classification focused / other
categories [str]
conversation_id Graph conversationId
conversation_index base64 conversationIndex
conversation_topic tema vlakna (z internet headers Thread-Topic)
in_reply_to Message-ID predchozi zpravy
internet_references [Message-ID]
received_at datetime UTC
sent_at datetime UTC
created_at datetime UTC
modified_at datetime UTC
folder_id Graph parentFolderId
folder_path cela cesta slozky (napr. Inbox/Subfolder)
sender.email emailova adresa odesilatele
sender.name zobrazovane jmeno
to retezec To (joined)
cc retezec CC
bcc retezec BCC
recipients [{type, email, name}]
body_html HTML telo (max 2 MB)
body_preview textovy nahled (max 255 znaku)
attachments [{filename, size_bytes, mime_type, is_inline, graph_att_id}]
headers dict internet headers
parsed_at datetime UTC
Indexy:
received_at, sent_at, sender.email, graph_id (unique),
conversation_id, folder_path, has_attachments, categories,
importance, flag_status, is_read,
text_search (subject + body_preview + to + cc)
Historie verzi:
1.0 2026-06-02 Inicialni verze
1.1 2026-06-02 Pridany rezimy --mode full/new-only/sync;
odstranen --skip-existing (nahrazen --mode new-only)
1.2 2026-06-02 $expand attachments s $select (bez contentBytes — rychlejsi);
prilohy ukladaji graph_att_id pro prime stazeni bez name-matchingu
"""
import sys
import re
import logging
import argparse
import base64
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_MAILBOX = "ordinace@buzalkova.cz"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL = "ordinace@buzalkova.cz"
BATCH_SIZE = 100
PAGE_SIZE = 50
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.2"
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
IMPORTANCE_MAP = {"low": 0, "normal": 1, "high": 2}
FLAG_STATUS_MAP = {"notFlagged": 0, "flagged": 1, "complete": 2}
RE_SUBJECT = re.compile(r"^(RE|FW|AW|SV|VS|TR|WG|odpov[eě]d[ťt]|fwd?)[:\s]+", re.IGNORECASE)
# $expand prilohy bez contentBytes — jen metadata co potrebujeme
ATT_EXPAND = "attachments($select=id,name,contentType,size,isInline)"
MSG_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification,internetMessageHeaders"
)
# Pro sync mode staci jen menitelna pole — rychlejsi fetch
MSG_SELECT_SYNC = (
"id,internetMessageId,isRead,isDraft,flag,categories,"
"lastModifiedDateTime,parentFolderId,importance"
)
# ─── Graph API helpers ────────────────────────────────────────────────────────
_graph_token: Optional[str] = None
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET failed after retry: {url}")
def get_all_folders(parent_id: str = None, parent_path: str = "") -> list[dict]:
"""Rekurzivne nacte vsechny slozky schranky. Vraci [{id, path}]."""
if parent_id is None:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders"
else:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{parent_id}/childFolders"
folders = []
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
while url:
data = graph_get(url, params)
for f in data.get("value", []):
path = f"{parent_path}/{f['displayName']}".lstrip("/")
folders.append({"id": f["id"], "path": path})
if f.get("childFolderCount", 0) > 0:
folders.extend(get_all_folders(f["id"], path))
url = data.get("@odata.nextLink")
params = None
return folders
def iter_folder_messages(folder_id: str, select: str = MSG_SELECT, expand_attachments: bool = True):
"""Generator: vraci zpravy ze slozky po strankach."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{folder_id}/messages"
params = {"$top": PAGE_SIZE, "$select": select}
if expand_attachments:
params["$expand"] = ATT_EXPAND
while url:
data = graph_get(url, params)
for msg in data.get("value", []):
yield msg
url = data.get("@odata.nextLink")
params = None
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def parse_date(raw) -> Optional[datetime]:
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
def normalize_subject(subject: str) -> str:
s = subject.strip()
while True:
m = RE_SUBJECT.match(s)
if not m:
break
s = s[m.end():].strip()
return s
def parse_headers(raw_headers: list) -> dict:
result = {}
for h in raw_headers:
k = h["name"].lower().replace("-", "_")
v = h["value"]
if k in result:
existing = result[k]
result[k] = existing + [v] if isinstance(existing, list) else [existing, v]
else:
result[k] = v
return result
def format_recipients(lst: list) -> str:
return "; ".join(
f'{r["emailAddress"].get("name", "")} <{r["emailAddress"].get("address", "")}>'.strip()
for r in lst
)
# ─── Extrakce zprávy ─────────────────────────────────────────────────────────
def extract_message(msg: dict, folder_path: str) -> Optional[dict]:
"""Plna extrakce — pouziva se pro mode full a nove zpravy v sync/new-only."""
try:
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
subject = msg.get("subject") or ""
body_html = None
body_preview = msg.get("bodyPreview") or ""
body = msg.get("body", {})
if body.get("contentType") == "html":
content = body.get("content") or ""
body_html = content if len(content) <= 2 * 1024 * 1024 else content[:2 * 1024 * 1024]
elif body.get("contentType") == "text":
body_preview = (body.get("content") or "")[:2000]
sender_ea = (msg.get("from") or msg.get("sender") or {}).get("emailAddress", {})
to_list = msg.get("toRecipients", [])
cc_list = msg.get("ccRecipients", [])
bcc_list = msg.get("bccRecipients", [])
recipients = (
[{"type": "to", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in to_list] +
[{"type": "cc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in cc_list] +
[{"type": "bcc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in bcc_list]
)
importance = IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1)
flag_status = FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0)
raw_headers = msg.get("internetMessageHeaders") or []
headers = parse_headers(raw_headers)
in_reply_to = headers.get("in_reply_to", "")
if isinstance(in_reply_to, list):
in_reply_to = in_reply_to[0]
refs_raw = headers.get("references", "")
if isinstance(refs_raw, list):
refs_raw = " ".join(refs_raw)
internet_refs = [r.strip() for r in refs_raw.split() if r.strip()] if refs_raw else []
conv_topic = headers.get("thread_topic", "")
if isinstance(conv_topic, list):
conv_topic = conv_topic[0]
conv_index = ""
ci_raw = msg.get("conversationIndex")
if ci_raw:
try:
conv_index = base64.b64encode(base64.b64decode(ci_raw)).decode()
except Exception:
conv_index = ci_raw
attachments = []
for att in msg.get("attachments") or []:
fname = att.get("name") or ""
if not fname:
continue
attachments.append({
"filename": fname,
"size_bytes": att.get("size", 0),
"mime_type": att.get("contentType", "application/octet-stream"),
"is_inline": att.get("isInline", False),
"graph_att_id": att.get("id"),
})
return {
"_id": mid,
"graph_id": msg["id"],
"subject": subject,
"normalized_subject": normalize_subject(subject),
"importance": importance,
"flag_status": flag_status,
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"has_attachments": msg.get("hasAttachments", False),
"attachment_count": len(attachments),
"inference_classification": msg.get("inferenceClassification", ""),
"categories": msg.get("categories") or [],
"conversation_id": msg.get("conversationId", ""),
"conversation_index": conv_index,
"conversation_topic": conv_topic,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"received_at": parse_date(msg.get("receivedDateTime")),
"sent_at": parse_date(msg.get("sentDateTime")),
"created_at": parse_date(msg.get("createdDateTime")),
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"sender": {
"email": sender_ea.get("address", ""),
"name": sender_ea.get("name", ""),
},
"to": format_recipients(to_list),
"cc": format_recipients(cc_list),
"bcc": format_recipients(bcc_list),
"recipients": recipients,
"body_html": body_html,
"body_preview": body_preview,
"attachments": attachments,
"headers": headers,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg.get("id", "?"), e)
return None
def extract_sync_fields(msg: dict, folder_path: str) -> dict:
"""Jen menitelna pole — pouziva se v sync mode pro existujici zpravy."""
return {
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"flag_status": FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0),
"importance": IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1),
"categories": msg.get("categories") or [],
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
def create_indexes(col):
print(" Vytvarim indexy...")
col.create_index([("received_at", ASCENDING)])
col.create_index([("sent_at", ASCENDING)])
col.create_index([("sender.email", ASCENDING)])
col.create_index([("graph_id", ASCENDING)], unique=True, sparse=True)
col.create_index([("conversation_id", ASCENDING)])
col.create_index([("folder_path", ASCENDING)])
col.create_index([("has_attachments", ASCENDING)])
col.create_index([("categories", ASCENDING)])
col.create_index([("importance", ASCENDING)])
col.create_index([("flag_status", ASCENDING)])
col.create_index([("is_read", ASCENDING)])
col.create_index([
("subject", TEXT),
("body_preview", TEXT),
("to", TEXT),
("cc", TEXT),
], name="text_search", default_language="none")
print(" Indexy hotovy.")
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"parse_emails_graph v{SCRIPT_VERSION}")
ap.add_argument("--mode", default="full", choices=["full", "new-only", "sync"],
help="full=plny upsert (vychozi) | new-only=jen nove zpravy | "
"sync=existujici aktualizuje jen menitelna pole, nove importuje cely")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N zprav (0 = vse)")
ap.add_argument("--folder", default="",
help="Zpracovat jen slozku se zadanym nazvem (napr. Inbox)")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
start = datetime.now()
print(f"=== parse_emails_graph v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {GRAPH_MAILBOX}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
print(f"Režim: {args.mode}")
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col = client[MONGO_DB][MONGO_COL]
# Existující _id (potřeba pro new-only a sync)
existing: set = set()
if args.mode in ("new-only", "sync"):
print(" Nacitam existujici zaznamy z MongoDB...")
existing = set(col.distinct("_id"))
print(f" {len(existing)} jiz importovano")
print("\nNacitam seznam slozek...")
all_folders = get_all_folders()
if args.folder:
all_folders = [f for f in all_folders if args.folder.lower() in f["path"].lower()]
print(f" Slozek ke zpracovani: {len(all_folders)}")
for f in all_folders:
print(f" {f['path']}")
# V sync mode fetchujeme jen menitelna pole
is_sync = args.mode == "sync"
msg_select = MSG_SELECT_SYNC if is_sync else MSG_SELECT
expand_att = not is_sync
batch = []
ok_count = 0
sync_count = 0
err_count = 0
skip_count = 0
total_i = 0
def flush():
if not batch:
return
try:
col.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
print()
for folder in all_folders:
print(f"--- Složka: {folder['path']} ---")
folder_count = 0
for msg in iter_folder_messages(folder["id"], select=msg_select, expand_attachments=expand_att):
if args.limit and total_i >= args.limit:
break
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
total_i += 1
folder_count += 1
if args.mode == "new-only" and mid in existing:
skip_count += 1
continue
if is_sync and mid in existing:
# Sync existujici — jen menitelna pole
fields = extract_sync_fields(msg, folder["path"])
batch.append(UpdateOne({"_id": mid}, {"$set": fields}))
sync_count += 1
status = "SYN "
print(f" {total_i:>6} {status} {mid[:80]}")
else:
# Full extract (new-only nove, sync nove, full vse)
# Pro sync nove zpravy potrebujeme plny fetch
if is_sync:
full_url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/messages/{msg['id']}"
full_params = {"$select": MSG_SELECT, "$expand": ATT_EXPAND}
try:
msg = graph_get(full_url, full_params)
except Exception as e:
logging.error("full fetch failed [%s]: %s", msg.get("id","?"), e)
err_count += 1
continue
doc = extract_message(msg, folder["path"])
if doc is None:
err_count += 1
status = "ERR "
print(f" {total_i:>6} {status} {mid[:80]}")
else:
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
ok_count += 1
status = "OK "
subject_str = (doc.get("subject") or "")[:60]
sender_str = (doc.get("sender", {}).get("email") or "")[:40]
print(f" {total_i:>6} {status} {subject_str:<60} {sender_str}")
if len(batch) >= BATCH_SIZE:
flush()
if total_i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = total_i / elapsed if elapsed > 0 else 0
print(f" {''*80}")
print(f" Průběh: ok={ok_count} sync={sync_count} skip={skip_count} err={err_count} {rate:.1f} msg/s")
print(f" {''*80}")
flush()
print(f"{folder_count} zprav ze slozky {folder['path']}")
if args.limit and total_i >= args.limit:
break
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"Vysledek: ok={ok_count} | sync={sync_count} | skip={skip_count} | err={err_count}")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"Dokumentu v kolekci: {col.count_documents({})}")
if not args.no_indexes:
print()
create_indexes(col)
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
+449
View File
@@ -0,0 +1,449 @@
"""
download_attachments_v1.0.py
Nazev: download_attachments_v1.0.py
Verze: 1.0
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Stahuje skutecne prilohy (is_inline=False) vsech emailu z MongoDB kolekce
ordinace@buzalkova.cz primo pres Microsoft Graph API a uklada je do
adresare /mnt/Emails/ordinace@buzalkova.cz/Attachments/.
Deduplikace podle SHA256 hashe obsahu:
- stejny hash = soubor uz existuje -> preskoci
- prvni vyskytu souboru: ulozi pod puvodnimnazvem
- kolize nazvu (stejny nazev, jiny hash): faktura_2.pdf, faktura_3.pdf ...
Po ulozeni aktualizuje MongoDB:
- v email dokumentu: kazda priloha dostane file_hash + local_path
- kolekce emaily.attachments_index: _id=hash, filename, path, size_bytes,
mime_type, first_seen_at, ref_count (pocet emailu ktery ji obsahuje)
Bezpecne prerusit a opakovat:
- zpravy kde jsou vsechny prilohy uz stazene (maji file_hash) se preskoci
- --force-recheck znovu overi i uz stazene (pro pripad zmen na disku)
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
python download_attachments_v1.0.py # stahni vse co chybi
python download_attachments_v1.0.py --limit 50 # test na prvnich 50 emailech
python download_attachments_v1.0.py --force-recheck # overi i uz stazene
Docker (po pridani mountu /mnt/user/Emails -> /mnt/Emails):
docker exec -it python-runner python /scripts/download_attachments_v1.0.py
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
Struktura na disku:
/mnt/Emails/
└── ordinace@buzalkova.cz/
└── Attachments/
├── faktura_2026.pdf
├── vysledky_lab.pdf
├── vysledky_lab_2.pdf <- kolize nazvu, jiny obsah
└── ...
Kolekce emaily.attachments_index:
_id SHA256 hash (hex)
filename nazev souboru na disku (prvni vyskytu)
local_path relativni cesta od Attachments/ (zatim = filename)
size_bytes velikost souboru
mime_type MIME typ
first_seen_at datetime UTC
ref_count v kolika emailech se tato priloha vyskytuje
Aktualizace v email dokumentu (kolekce ordinace@buzalkova.cz):
attachments[i].file_hash SHA256 hash
attachments[i].local_path cesta relativni od Attachments/
Historie verzi:
1.0 2026-06-02 Inicialni verze
"""
import sys
import hashlib
import logging
import argparse
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from pymongo import MongoClient, UpdateOne
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_MAILBOX = "ordinace@buzalkova.cz"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL_EMAILS = "ordinace@buzalkova.cz"
MONGO_COL_INDEX = "attachments_index"
ATTACHMENTS_DIR = Path("/mnt/Emails/ordinace@buzalkova.cz/Attachments")
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.0"
BATCH_SIZE = 50
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
_graph_token: Optional[str] = None
# ─── Graph API ────────────────────────────────────────────────────────────────
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get_bytes(url: str) -> bytes:
"""Stahne binarni obsah prilohy."""
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, timeout=120, stream=True)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.content
raise RuntimeError(f"Graph GET bytes failed: {url}")
def graph_get_json(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET json failed: {url}")
def fetch_attachment_content(graph_message_id: str, attachment_id: str) -> Optional[bytes]:
"""Stahne obsah prilohy pres Graph API."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/messages/{graph_message_id}/attachments/{attachment_id}/$value"
try:
return graph_get_bytes(url)
except Exception as e:
logging.error("fetch_attachment_content failed [msg=%s att=%s]: %s", graph_message_id, attachment_id, e)
return None
def fetch_message_attachments(graph_message_id: str) -> list[dict]:
"""Nacte seznam priloh zpravy z Graph API (metadata vcetne attachment ID)."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/messages/{graph_message_id}/attachments"
try:
data = graph_get_json(url, {"$select": "id,name,contentType,size,isInline,contentId"})
return data.get("value", [])
except Exception as e:
logging.error("fetch_message_attachments failed [%s]: %s", graph_message_id, e)
return []
# ─── Dedup + ukládání ─────────────────────────────────────────────────────────
def sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def resolve_filename(desired_name: str, att_dir: Path, hash_val: str, index_col) -> str:
"""
Vrati nazev souboru ktery pouzit pro ulozeni.
Pokud desired_name jiz existuje s jinym hashem, prida suffix _2, _3 ...
"""
# Zkontroluj jestli existujici soubor se stejnym nazvem ma stejny hash
existing = index_col.find_one({"filename": desired_name})
if existing:
if existing["_id"] == hash_val:
return desired_name # Stejny hash, stejne jmeno — dedup hit
# Jiny hash — hledej volny suffix
stem = Path(desired_name).stem
suffix = Path(desired_name).suffix
n = 2
while True:
candidate = f"{stem}_{n}{suffix}"
if not (att_dir / candidate).exists():
# Overi ze ani v indexu neni tento kandidat s jinym hashem
ex2 = index_col.find_one({"filename": candidate})
if not ex2 or ex2["_id"] == hash_val:
return candidate
n += 1
return desired_name
def save_attachment(content: bytes, original_name: str, att_dir: Path, index_col) -> tuple[str, str, bool]:
"""
Ulozi prilohu s deduplikaci.
Vraci (hash, local_path, was_new):
was_new=True -> soubor byl ulozen
was_new=False -> hash uz existoval, soubor preskocen
"""
hash_val = sha256(content)
# Zkontroluj index — pokud hash uz existuje, vrat existujici zaznam
existing = index_col.find_one({"_id": hash_val})
if existing:
# Zvys pocitadlo referenci
index_col.update_one({"_id": hash_val}, {"$inc": {"ref_count": 1}})
return hash_val, existing["local_path"], False
# Novy soubor — urcit nazev
safe_name = "".join(c if c.isalnum() or c in "._- " else "_" for c in original_name).strip()
if not safe_name:
safe_name = f"attachment_{hash_val[:8]}"
filename = resolve_filename(safe_name, att_dir, hash_val, index_col)
file_path = att_dir / filename
# Uloz soubor
file_path.write_bytes(content)
# Zaznamenej do indexu
index_col.insert_one({
"_id": hash_val,
"filename": filename,
"local_path": filename,
"size_bytes": len(content),
"mime_type": "",
"first_seen_at": datetime.now(timezone.utc).replace(tzinfo=None),
"ref_count": 1,
})
return hash_val, filename, True
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"download_attachments v{SCRIPT_VERSION}")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N emailu (0 = vse)")
ap.add_argument("--force-recheck", action="store_true",
help="Znovu overi i emaily kde prilohy uz maji file_hash")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
start = datetime.now()
print(f"=== download_attachments v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {GRAPH_MAILBOX}")
print(f"Cilovy adresar: {ATTACHMENTS_DIR}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}")
# Adresar
ATTACHMENTS_DIR.mkdir(parents=True, exist_ok=True)
print(f" Adresar OK")
# Graph
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
# MongoDB
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col_emails = client[MONGO_DB][MONGO_COL_EMAILS]
col_index = client[MONGO_DB][MONGO_COL_INDEX]
# Indexy na attachment index kolekci
if not args.no_indexes:
col_index.create_index("filename")
col_index.create_index("mime_type")
# Dotaz — emaily s prilohou ktere jeste nebyly zpracovany
if args.force_recheck:
query = {"has_attachments": True}
else:
query = {
"has_attachments": True,
"attachments": {
"$elemMatch": {
"is_inline": False,
"file_hash": {"$exists": False},
}
}
}
total = col_emails.count_documents(query)
print(f"\nEmailu ke zpracovani: {total}")
if total == 0:
print("Neni co stahnout.")
client.close()
return
cursor = col_emails.find(query, {"_id": 1, "graph_id": 1, "subject": 1, "attachments": 1})
if args.limit:
cursor = cursor.limit(args.limit)
ok_count = 0
new_count = 0
skip_count = 0
err_count = 0
email_i = 0
batch = []
def flush():
if not batch:
return
try:
col_emails.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
for email_doc in cursor:
email_i += 1
email_id = email_doc["_id"]
graph_id = email_doc.get("graph_id", "")
subject = (email_doc.get("subject") or "")[:60]
att_list = email_doc.get("attachments") or []
# Jen skutecne prilohy
real_atts = [a for a in att_list if not a.get("is_inline", False)]
if not real_atts:
continue
print(f"\n {email_i:>5}/{total} {subject}")
# Nacti attachment IDs z Graph API
graph_atts = fetch_message_attachments(graph_id)
graph_att_map = {a["name"]: a for a in graph_atts if not a.get("isInline", False)}
updated_atts = list(att_list)
email_ok = True
for i, att in enumerate(updated_atts):
if att.get("is_inline", False):
continue
if not args.force_recheck and att.get("file_hash"):
skip_count += 1
print(f" SKIP {att['filename']}")
continue
att_name = att.get("filename", "")
graph_att = graph_att_map.get(att_name)
if not graph_att:
# Zkus najit podle casti nazvu
for gname, ga in graph_att_map.items():
if att_name.lower() in gname.lower():
graph_att = ga
break
if not graph_att:
logging.error("attachment not found in Graph [email=%s att=%s]", email_id, att_name)
print(f" ERR {att_name} (nenalezeno v Graph)")
err_count += 1
email_ok = False
continue
# Stahni obsah
content = fetch_attachment_content(graph_id, graph_att["id"])
if content is None:
err_count += 1
email_ok = False
print(f" ERR {att_name} (stazeni selhalo)")
continue
# Uloz s dedupem
hash_val, local_path, was_new = save_attachment(content, att_name, ATTACHMENTS_DIR, col_index)
# Aktualizuj MIME typ v indexu
col_index.update_one(
{"_id": hash_val},
{"$set": {"mime_type": att.get("mime_type", graph_att.get("contentType", ""))}},
)
# Zaznamenej do emailu
updated_atts[i] = {**att, "file_hash": hash_val, "local_path": local_path}
if was_new:
new_count += 1
print(f" NEW {local_path} ({len(content):,} B)")
else:
skip_count += 1
print(f" DUP {att_name} -> {local_path}")
if email_ok:
ok_count += 1
# Uloz aktualizovane prilohy zpet do emailu
batch.append(UpdateOne(
{"_id": email_id},
{"$set": {"attachments": updated_atts}}
))
if len(batch) >= BATCH_SIZE:
flush()
if email_i % 100 == 0:
elapsed = (datetime.now() - start).total_seconds()
print(f" {''*60}")
print(f" Průběh: emaily={email_i}/{total} nove={new_count} dup={skip_count} err={err_count}")
print(f" {''*60}")
flush()
elapsed_total = (datetime.now() - start).total_seconds()
files_total = col_index.count_documents({})
size_total = sum(d.get("size_bytes", 0) for d in col_index.find({}, {"size_bytes": 1}))
print(f"\n{'='*52}")
print(f"Vysledek: emaily={ok_count} | nove soubory={new_count} | duplikaty={skip_count} | err={err_count}")
print(f"Souboru v indexu: {files_total} ({size_total/1024/1024:.1f} MB)")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
+428
View File
@@ -0,0 +1,428 @@
"""
download_attachments_v1.1.py
Nazev: download_attachments_v1.1.py
Verze: 1.1
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Stahuje skutecne prilohy (is_inline=False) vsech emailu z MongoDB
pres Microsoft Graph API a uklada je do adresare
/mnt/Emails/<schránka>/Attachments/.
Schránka se predava jako povinny parametr --mailbox.
Deduplikace podle SHA256 hashe obsahu:
- stejny hash = soubor uz existuje -> preskoci
- prvni vyskytu souboru: ulozi pod puvodnimnazvem
- kolize nazvu (stejny nazev, jiny hash): faktura_2.pdf, faktura_3.pdf ...
Po ulozeni aktualizuje MongoDB:
- v email dokumentu: kazda priloha dostane file_hash + local_path
- kolekce emaily.attachments_index: _id=hash, filename, path, size_bytes,
mime_type, mailbox, first_seen_at, ref_count
Bezpecne prerusit a opakovat — emaily kde vsechny prilohy maji file_hash
se preskoci. --force-recheck znovu overi i uz stazene.
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
python download_attachments_v1.1.py --mailbox ordinace@buzalkova.cz
python download_attachments_v1.1.py --mailbox vladimir.buzalka@buzalka.cz --limit 50
python download_attachments_v1.1.py --mailbox ordinace@buzalkova.cz --force-recheck
Docker:
docker exec -it python-runner python /scripts/download_attachments_v1.1.py \\
--mailbox ordinace@buzalkova.cz
Zavislosti:
msal, requests, pymongo
Python 3.10+
Struktura na disku:
/mnt/Emails/
└── <mailbox>/
└── Attachments/
├── faktura_2026.pdf
├── vysledky_lab.pdf
├── vysledky_lab_2.pdf
└── ...
Kolekce emaily.attachments_index:
_id SHA256 hash (hex)
filename nazev souboru na disku
local_path relativni cesta od Attachments/
size_bytes velikost souboru
mime_type MIME typ
mailbox schránka ze ktere pochazi prvni vyskytu
first_seen_at datetime UTC
ref_count v kolika emailech se tato priloha vyskytuje
Historie verzi:
1.0 2026-06-02 Inicialni verze
1.1 2026-06-02 Schránka jako parametr --mailbox (univerzalni pouziti)
"""
import sys
import hashlib
import logging
import argparse
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from pymongo import MongoClient, UpdateOne
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL_INDEX = "attachments_index"
EMAILS_BASE_DIR = Path("/mnt/Emails")
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.1"
BATCH_SIZE = 50
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
_graph_token: Optional[str] = None
# ─── Graph API ────────────────────────────────────────────────────────────────
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get_bytes(url: str) -> bytes:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, timeout=120, stream=True)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.content
raise RuntimeError(f"Graph GET bytes failed: {url}")
def graph_get_json(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET json failed: {url}")
def fetch_message_attachments(mailbox: str, graph_message_id: str) -> list[dict]:
url = f"{GRAPH_URL}/users/{mailbox}/messages/{graph_message_id}/attachments"
try:
data = graph_get_json(url, {"$select": "id,name,contentType,size,isInline,contentId"})
return data.get("value", [])
except Exception as e:
logging.error("fetch_message_attachments failed [%s]: %s", graph_message_id, e)
return []
def fetch_attachment_content(mailbox: str, graph_message_id: str, attachment_id: str) -> Optional[bytes]:
url = f"{GRAPH_URL}/users/{mailbox}/messages/{graph_message_id}/attachments/{attachment_id}/$value"
try:
return graph_get_bytes(url)
except Exception as e:
logging.error("fetch_attachment_content failed [msg=%s att=%s]: %s", graph_message_id, attachment_id, e)
return None
# ─── Dedup + ukládání ─────────────────────────────────────────────────────────
def sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def safe_filename(name: str) -> str:
safe = "".join(c if c.isalnum() or c in "._- " else "_" for c in name).strip()
return safe or "attachment"
def resolve_filename(desired_name: str, att_dir: Path, hash_val: str, col_index) -> str:
"""Vrati nazev souboru pro ulozeni — resi kolize (stejny nazev, jiny hash)."""
existing = col_index.find_one({"filename": desired_name})
if existing:
if existing["_id"] == hash_val:
return desired_name # Dedup hit — stejny hash
# Kolize — hledej volny suffix
stem = Path(desired_name).stem
suffix = Path(desired_name).suffix
n = 2
while True:
candidate = f"{stem}_{n}{suffix}"
ex2 = col_index.find_one({"filename": candidate})
if not ex2 or ex2["_id"] == hash_val:
if not (att_dir / candidate).exists() or (ex2 and ex2["_id"] == hash_val):
return candidate
n += 1
return desired_name
def save_attachment(
content: bytes,
original_name: str,
mime_type: str,
mailbox: str,
att_dir: Path,
col_index,
) -> tuple[str, str, bool]:
"""
Ulozi prilohu s deduplikaci.
Vraci (hash, local_path, was_new).
"""
hash_val = sha256(content)
existing = col_index.find_one({"_id": hash_val})
if existing:
col_index.update_one({"_id": hash_val}, {"$inc": {"ref_count": 1}})
return hash_val, existing["local_path"], False
filename = resolve_filename(safe_filename(original_name), att_dir, hash_val, col_index)
file_path = att_dir / filename
file_path.write_bytes(content)
col_index.insert_one({
"_id": hash_val,
"filename": filename,
"local_path": filename,
"size_bytes": len(content),
"mime_type": mime_type,
"mailbox": mailbox,
"first_seen_at": datetime.now(timezone.utc).replace(tzinfo=None),
"ref_count": 1,
})
return hash_val, filename, True
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"download_attachments v{SCRIPT_VERSION}")
ap.add_argument("--mailbox", required=True,
help="Emailova schranka (napr. ordinace@buzalkova.cz)")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N emailu (0 = vse)")
ap.add_argument("--force-recheck", action="store_true",
help="Znovu overi i emaily kde prilohy uz maji file_hash")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na attachments_index kolekci")
args = ap.parse_args()
mailbox = args.mailbox
att_dir = EMAILS_BASE_DIR / mailbox / "Attachments"
mongo_col = mailbox
start = datetime.now()
print(f"=== download_attachments v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {mailbox}")
print(f"Cilovy adresar: {att_dir}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{mongo_col}")
att_dir.mkdir(parents=True, exist_ok=True)
print(" Adresar OK")
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col_emails = client[MONGO_DB][mongo_col]
col_index = client[MONGO_DB][MONGO_COL_INDEX]
if not args.no_indexes:
col_index.create_index("filename")
col_index.create_index("mime_type")
col_index.create_index("mailbox")
# Dotaz
if args.force_recheck:
query = {"has_attachments": True}
else:
query = {
"has_attachments": True,
"attachments": {
"$elemMatch": {
"is_inline": False,
"file_hash": {"$exists": False},
}
}
}
total = col_emails.count_documents(query)
print(f"\nEmailu ke zpracovani: {total}")
if total == 0:
print("Neni co stahnout.")
client.close()
return
cursor = col_emails.find(query, {"_id": 1, "graph_id": 1, "subject": 1, "attachments": 1})
if args.limit:
cursor = cursor.limit(args.limit)
ok_count = 0
new_count = 0
dup_count = 0
err_count = 0
email_i = 0
batch = []
def flush():
if not batch:
return
try:
col_emails.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
for email_doc in cursor:
email_i += 1
email_id = email_doc["_id"]
graph_id = email_doc.get("graph_id", "")
subject = (email_doc.get("subject") or "")[:60]
att_list = email_doc.get("attachments") or []
real_atts = [a for a in att_list if not a.get("is_inline", False)]
if not real_atts:
continue
print(f"\n {email_i:>5}/{total} {subject}")
graph_atts = fetch_message_attachments(mailbox, graph_id)
graph_att_map = {a["name"]: a for a in graph_atts if not a.get("isInline", False)}
updated_atts = list(att_list)
email_ok = True
for i, att in enumerate(updated_atts):
if att.get("is_inline", False):
continue
if not args.force_recheck and att.get("file_hash"):
print(f" SKIP {att['filename']}")
continue
att_name = att.get("filename", "")
graph_att = graph_att_map.get(att_name)
if not graph_att:
for gname, ga in graph_att_map.items():
if att_name.lower() in gname.lower():
graph_att = ga
break
if not graph_att:
logging.error("attachment not found in Graph [email=%s att=%s]", email_id, att_name)
print(f" ERR {att_name} (nenalezeno v Graph)")
err_count += 1
email_ok = False
continue
content = fetch_attachment_content(mailbox, graph_id, graph_att["id"])
if content is None:
err_count += 1
email_ok = False
print(f" ERR {att_name} (stazeni selhalo)")
continue
mime_type = att.get("mime_type") or graph_att.get("contentType", "")
hash_val, local_path, was_new = save_attachment(
content, att_name, mime_type, mailbox, att_dir, col_index
)
updated_atts[i] = {**att, "file_hash": hash_val, "local_path": local_path}
if was_new:
new_count += 1
print(f" NEW {local_path} ({len(content):,} B)")
else:
dup_count += 1
print(f" DUP {att_name} -> {local_path}")
if email_ok:
ok_count += 1
batch.append(UpdateOne({"_id": email_id}, {"$set": {"attachments": updated_atts}}))
if len(batch) >= BATCH_SIZE:
flush()
if email_i % 100 == 0:
elapsed = (datetime.now() - start).total_seconds()
print(f" {''*60}")
print(f" Průběh: emaily={email_i}/{total} nove={new_count} dup={dup_count} err={err_count}")
print(f" {''*60}")
flush()
elapsed_total = (datetime.now() - start).total_seconds()
files_total = col_index.count_documents({})
size_total = sum(d.get("size_bytes", 0) for d in col_index.find({}, {"size_bytes": 1}))
print(f"\n{'='*52}")
print(f"Vysledek: emaily={ok_count} | nove={new_count} | dup={dup_count} | err={err_count}")
print(f"Souboru v indexu: {files_total} ({size_total / 1024 / 1024:.1f} MB)")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
+483
View File
@@ -0,0 +1,483 @@
"""
download_attachments_v1.3.py
Nazev: download_attachments_v1.3.py
Verze: 1.3
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Stahuje skutecne prilohy (is_inline=False) vsech emailu z MongoDB
pres Microsoft Graph API a uklada je do adresare
/mnt/Emails/<schránka>/Attachments/.
Schránka se predava jako povinny parametr --mailbox.
Deduplikace podle SHA256 hashe obsahu:
- stejny hash = soubor uz existuje -> preskoci
- prvni vyskytu souboru: ulozi pod puvodnimnazvem
- kolize nazvu (stejny nazev, jiny hash): faktura_2.pdf, faktura_3.pdf ...
Po ulozeni aktualizuje MongoDB:
- v email dokumentu: kazda priloha dostane file_hash + local_path
- kolekce emaily.attachments_index: _id=hash, filename, path, size_bytes,
mime_type, mailbox, first_seen_at, ref_count
Bezpecne prerusit a opakovat — emaily kde vsechny prilohy maji file_hash
se preskoci. --force-recheck znovu overi i uz stazene.
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
python download_attachments_v1.3.py --mailbox ordinace@buzalkova.cz
python download_attachments_v1.3.py --mailbox ordinace@buzalkova.cz --limit 50
python download_attachments_v1.3.py --mailbox ordinace@buzalkova.cz --force-recheck
Docker:
docker exec -it python-runner python /scripts/download_attachments_v1.3.py \\
--mailbox ordinace@buzalkova.cz
Zavislosti:
msal, requests, pymongo
Python 3.10+
Historie verzi:
1.0 2026-06-02 Inicialni verze
1.1 2026-06-02 Schránka jako parametr --mailbox
1.2 2026-06-02 Oprava: Graph attachment mapa vcetne inline; normalizace nazvu;
preskoceni S/MIME; inline z Graphu -> SKIP ne ERR
1.3 2026-06-02 Primarni stazeni pres graph_att_id (prime ID bez name-matchingu);
oprava $select na attachment listu (odstranen contentId ktery
zpusoboval BadRequest a vracel prazdny seznam); name-matching
zustava jako fallback pro stare emaily bez graph_att_id
"""
import sys
import re
import hashlib
import logging
import argparse
import unicodedata
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from pymongo import MongoClient, UpdateOne
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL_INDEX = "attachments_index"
EMAILS_BASE_DIR = Path("/mnt/Emails")
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.3"
BATCH_SIZE = 50
# Typy příloh které přeskočíme (S/MIME podpisy, certifikáty)
SKIP_EXTENSIONS = {".p7m", ".p7s", ".p7c", ".p7b"}
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
_graph_token: Optional[str] = None
# ─── Graph API ────────────────────────────────────────────────────────────────
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get_bytes(url: str) -> bytes:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, timeout=120, stream=True)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.content
raise RuntimeError(f"Graph GET bytes failed: {url}")
def graph_get_json(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET json failed: {url}")
def fetch_message_attachments(mailbox: str, graph_message_id: str) -> list[dict]:
"""Nacte metadata vsech priloh zpravy (bez contentBytes)."""
url = f"{GRAPH_URL}/users/{mailbox}/messages/{graph_message_id}/attachments"
try:
# Pozor: contentId NENI v base attachment type — nesmi byt v $select
data = graph_get_json(url, {"$select": "id,name,contentType,size,isInline"})
return data.get("value", [])
except Exception as e:
logging.error("fetch_message_attachments failed [%s]: %s", graph_message_id, e)
return []
def fetch_attachment_content(mailbox: str, graph_message_id: str, attachment_id: str) -> Optional[bytes]:
url = f"{GRAPH_URL}/users/{mailbox}/messages/{graph_message_id}/attachments/{attachment_id}/$value"
try:
return graph_get_bytes(url)
except Exception as e:
logging.error("fetch_attachment_content failed [msg=%s att=%s]: %s",
graph_message_id, attachment_id, e)
return None
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def normalize_name(name: str) -> str:
"""Normalizuje název pro porovnání — lowercase, bez diakritiky, jen alnum+._-"""
nfkd = unicodedata.normalize("NFKD", name.lower().strip())
ascii_str = "".join(c for c in nfkd if not unicodedata.combining(c))
return re.sub(r"[^\w.\-]", "_", ascii_str)
def find_graph_att(att_name: str, att_size: int, graph_atts: list[dict]) -> Optional[dict]:
"""Fallback: hleda prilohu v Graph listu podle jmena (pro emaily bez graph_att_id)."""
# 1. Presna shoda
for ga in graph_atts:
if ga["name"] == att_name:
return ga
norm_want = normalize_name(att_name)
# 2. Normalizovana shoda
for ga in graph_atts:
if normalize_name(ga["name"]) == norm_want:
return ga
# 3. Normalizovana shoda + velikost (±10 %)
for ga in graph_atts:
if normalize_name(ga["name"]) == norm_want:
ga_size = ga.get("size", 0)
if att_size == 0 or ga_size == 0 or abs(ga_size - att_size) / max(ga_size, att_size) < 0.1:
return ga
# 4. Castecna shoda sufixu (posledních 20 znaků normalizovaného jména)
for ga in graph_atts:
if norm_want[-20:] and normalize_name(ga["name"]).endswith(norm_want[-20:]):
return ga
return None
def sha256(data: bytes) -> str:
return hashlib.sha256(data).hexdigest()
def safe_filename(name: str) -> str:
safe = "".join(c if c.isalnum() or c in "._- ()" else "_" for c in name).strip()
return safe or "attachment"
def resolve_filename(desired_name: str, att_dir: Path, hash_val: str, col_index) -> str:
existing = col_index.find_one({"filename": desired_name})
if existing:
if existing["_id"] == hash_val:
return desired_name
stem = Path(desired_name).stem
suffix = Path(desired_name).suffix
n = 2
while True:
candidate = f"{stem}_{n}{suffix}"
ex2 = col_index.find_one({"filename": candidate})
if not ex2 or ex2["_id"] == hash_val:
if not (att_dir / candidate).exists() or (ex2 and ex2["_id"] == hash_val):
return candidate
n += 1
return desired_name
def save_attachment(
content: bytes,
original_name: str,
mime_type: str,
mailbox: str,
att_dir: Path,
col_index,
) -> tuple[str, str, bool]:
hash_val = sha256(content)
existing = col_index.find_one({"_id": hash_val})
if existing:
col_index.update_one({"_id": hash_val}, {"$inc": {"ref_count": 1}})
return hash_val, existing["local_path"], False
filename = resolve_filename(safe_filename(original_name), att_dir, hash_val, col_index)
file_path = att_dir / filename
file_path.write_bytes(content)
col_index.insert_one({
"_id": hash_val,
"filename": filename,
"local_path": filename,
"size_bytes": len(content),
"mime_type": mime_type,
"mailbox": mailbox,
"first_seen_at": datetime.now(timezone.utc).replace(tzinfo=None),
"ref_count": 1,
})
return hash_val, filename, True
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"download_attachments v{SCRIPT_VERSION}")
ap.add_argument("--mailbox", required=True,
help="Emailova schranka (napr. ordinace@buzalkova.cz)")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N emailu (0 = vse)")
ap.add_argument("--force-recheck", action="store_true",
help="Znovu overi i emaily kde prilohy uz maji file_hash")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na attachments_index kolekci")
args = ap.parse_args()
mailbox = args.mailbox
att_dir = EMAILS_BASE_DIR / mailbox / "Attachments"
mongo_col = mailbox
start = datetime.now()
print(f"=== download_attachments v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {mailbox}")
print(f"Cilovy adresar: {att_dir}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{mongo_col}")
att_dir.mkdir(parents=True, exist_ok=True)
print(" Adresar OK")
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col_emails = client[MONGO_DB][mongo_col]
col_index = client[MONGO_DB][MONGO_COL_INDEX]
if not args.no_indexes:
col_index.create_index("filename")
col_index.create_index("mime_type")
col_index.create_index("mailbox")
if args.force_recheck:
query = {"has_attachments": True}
else:
query = {
"has_attachments": True,
"attachments": {
"$elemMatch": {
"is_inline": False,
"file_hash": {"$exists": False},
}
}
}
total = col_emails.count_documents(query)
print(f"\nEmailu ke zpracovani: {total}")
if total == 0:
print("Neni co stahnout.")
client.close()
return
cursor = col_emails.find(query, {"_id": 1, "graph_id": 1, "subject": 1, "attachments": 1})
if args.limit:
cursor = cursor.limit(args.limit)
ok_count = 0
new_count = 0
dup_count = 0
skip_count = 0
err_count = 0
email_i = 0
batch = []
def flush():
if not batch:
return
try:
col_emails.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
for email_doc in cursor:
email_i += 1
email_id = email_doc["_id"]
graph_id = email_doc.get("graph_id", "")
subject = (email_doc.get("subject") or "")[:60]
att_list = email_doc.get("attachments") or []
real_atts = [a for a in att_list if not a.get("is_inline", False)]
if not real_atts:
continue
print(f"\n {email_i:>5}/{total} {subject}")
# Nacti attachment list z Graphu jen pokud nektere prilohy nemaji graph_att_id
need_listing = any(
not a.get("is_inline", False)
and not (not args.force_recheck and a.get("file_hash"))
and not a.get("graph_att_id")
for a in att_list
)
graph_atts = fetch_message_attachments(mailbox, graph_id) if need_listing else []
updated_atts = list(att_list)
email_ok = True
for i, att in enumerate(updated_atts):
if att.get("is_inline", False):
continue
if not args.force_recheck and att.get("file_hash"):
continue
att_name = att.get("filename", "")
att_size = att.get("size_bytes", 0)
graph_att_id = att.get("graph_att_id")
# Preskoc S/MIME podpisy
if Path(att_name).suffix.lower() in SKIP_EXTENSIONS:
updated_atts[i] = {**att, "file_hash": "skip", "local_path": ""}
skip_count += 1
print(f" SKIP {att_name} (S/MIME)")
continue
# Primy pristup pres graph_att_id (emaily parsovane v1.2+)
if graph_att_id:
content = fetch_attachment_content(mailbox, graph_id, graph_att_id)
if content is None:
err_count += 1
email_ok = False
print(f" ERR {att_name} (stazeni selhalo)")
continue
# Zkontroluj zda jde skutecne o inline (pro edge case)
mime_type = att.get("mime_type", "")
else:
# Fallback: name matching pro stare emaily (parsovane pred v1.2)
graph_att = find_graph_att(att_name, att_size, graph_atts)
if not graph_att:
logging.error("attachment not found [email=%s att=%s]", email_id, att_name)
print(f" ERR {att_name} (nenalezeno)")
err_count += 1
email_ok = False
continue
# Pokud Graph rika ze je inline — preskoc
if graph_att.get("isInline", False):
updated_atts[i] = {**att, "is_inline": True, "file_hash": "skip", "local_path": ""}
skip_count += 1
print(f" SKIP {att_name} (inline obrazek)")
continue
content = fetch_attachment_content(mailbox, graph_id, graph_att["id"])
if content is None:
err_count += 1
email_ok = False
print(f" ERR {att_name} (stazeni selhalo)")
continue
mime_type = att.get("mime_type") or graph_att.get("contentType", "")
hash_val, local_path, was_new = save_attachment(
content, att_name, mime_type, mailbox, att_dir, col_index
)
updated_atts[i] = {**att, "file_hash": hash_val, "local_path": local_path}
if was_new:
new_count += 1
print(f" NEW {local_path} ({len(content):,} B)")
else:
dup_count += 1
print(f" DUP {att_name} -> {local_path}")
if email_ok:
ok_count += 1
batch.append(UpdateOne({"_id": email_id}, {"$set": {"attachments": updated_atts}}))
if len(batch) >= BATCH_SIZE:
flush()
if email_i % 100 == 0:
elapsed = (datetime.now() - start).total_seconds()
print(f" {''*60}")
print(f" Průběh: emaily={email_i}/{total} nove={new_count} dup={dup_count} skip={skip_count} err={err_count}")
print(f" {''*60}")
flush()
elapsed_total = (datetime.now() - start).total_seconds()
files_total = col_index.count_documents({})
size_total = sum(d.get("size_bytes", 0) for d in col_index.find({}, {"size_bytes": 1}))
print(f"\n{'='*52}")
print(f"Vysledek: emaily={ok_count} | nove={new_count} | dup={dup_count} | skip={skip_count} | err={err_count}")
print(f"Souboru v indexu: {files_total} ({size_total / 1024 / 1024:.1f} MB)")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
+560
View File
@@ -0,0 +1,560 @@
"""
parse_emails_graph_v1.0.py
Nazev: parse_emails_graph_v1.0.py
Verze: 1.0
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Cte vsechny emaily ze schranky ordinace@buzalkova.cz primo pres
Microsoft Graph API a importuje je jako dokumenty do MongoDB.
Ze kazde zpravy extrahuje vsechny dostupne vlastnosti:
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
- cas doruceni, odeslani, vytvoreni, modifikace (UTC)
- telo HTML (max 2 MB) + textovy preview
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag)
- internet headers (SPF, DKIM, Received, X-*, ...)
- MAPI-ekvivalenty: dulezitost, priznak, konverzacni vlakno,
kategorie, In-Reply-To, References, ...
- navic: isRead, isDraft, folder_path, inferenceClassification
Prochazi VSECHNY slozky schranky rekurzivne (Inbox, Sent, Deleted,
archivni slozky, ...).
DB: emaily
Kolekce: ordinace@buzalkova.cz
_id: Internet Message-ID (nebo "graphid:<id>" jako fallback)
Bezpecne prerusit a opakovat:
- upsert podle _id — duplicity se automaticky prepisi
- --skip-existing nacte seznam hotovych _id z MongoDB a preskoci je
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
python parse_emails_graph_v1.0.py # kompletni import
python parse_emails_graph_v1.0.py --limit 50 # test na prvnich 50
python parse_emails_graph_v1.0.py --skip-existing # pokracovani po preruseni
python parse_emails_graph_v1.0.py --folder Inbox # jen jedna slozka
python parse_emails_graph_v1.0.py --no-indexes # bez indexu na konci
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
Struktura dokumentu v MongoDB:
_id Internet Message-ID (nebo graphid: fallback)
graph_id Graph API message ID (pro pripadne dalsi operace)
subject predmet zpravy
normalized_subject predmet bez RE:/FW:/AW: prefixu
importance 0=nizka 1=normalni 2=vysoka
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
is_read bool — aktualni stav precteni ve schrance
is_draft bool
has_attachments bool
attachment_count int
inference_classification focused / other (Outlook AI trideni)
categories [str]
conversation_id Graph conversationId
conversation_index base64 conversationIndex
conversation_topic tema vlakna (z internet headers Thread-Topic)
in_reply_to Message-ID predchozi zpravy
internet_references [Message-ID] — cela historia vlakna
received_at datetime UTC
sent_at datetime UTC
created_at datetime UTC — cas vytvoreni zaznamu v M365
modified_at datetime UTC — cas posledni modifikace
folder_id Graph parentFolderId
folder_path cela cesta slozky (napr. Inbox/Subfolder)
sender.email emailova adresa odesilatele
sender.name zobrazovane jmeno odesilatele
to retezec To (joined)
cc retezec CC
bcc retezec BCC
recipients [{type, email, name}] — to/cc/bcc s typy
body_html HTML telo (max 2 MB)
body_preview textovy nahled (max 255 znaku z Graph)
attachments [{filename, size_bytes, mime_type,
content_id, is_inline}]
headers dict internet headers (lowercase_s_podtrzitky)
parsed_at datetime UTC — cas parsovani
Indexy:
received_at, sent_at, sender.email, graph_id (unique),
conversation_id, folder_path, has_attachments, categories,
importance, flag_status, is_read,
text_search (subject + body_preview + to + cc)
Historie verzi:
1.0 2026-06-02 Inicialni verze — Graph API jako zdroj
"""
import sys
import re
import logging
import argparse
import base64
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_MAILBOX = "ordinace@buzalkova.cz"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL = "ordinace@buzalkova.cz"
BATCH_SIZE = 100
PAGE_SIZE = 50
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.0"
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
IMPORTANCE_MAP = {"low": 0, "normal": 1, "high": 2}
FLAG_STATUS_MAP = {"notFlagged": 0, "flagged": 1, "complete": 2}
RE_SUBJECT = re.compile(r"^(RE|FW|AW|SV|VS|TR|WG|odpov[eě]d[ťt]|fwd?)[:\s]+", re.IGNORECASE)
MSG_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification,internetMessageHeaders"
)
# ─── Graph API helpers ────────────────────────────────────────────────────────
_graph_token: Optional[str] = None
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET failed after retry: {url}")
def get_all_folders(parent_id: str = None, parent_path: str = "") -> list[dict]:
"""Rekurzivne nacte vsechny slozky schranky. Vraci [{id, path}]."""
if parent_id is None:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders"
else:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{parent_id}/childFolders"
folders = []
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
while url:
data = graph_get(url, params)
for f in data.get("value", []):
path = f"{parent_path}/{f['displayName']}".lstrip("/")
folders.append({"id": f["id"], "path": path})
if f.get("childFolderCount", 0) > 0:
folders.extend(get_all_folders(f["id"], path))
url = data.get("@odata.nextLink")
params = None
return folders
def iter_folder_messages(folder_id: str):
"""Generator: vraci zpravy ze slozky po strankach."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{folder_id}/messages"
params = {"$top": PAGE_SIZE, "$select": MSG_SELECT, "$expand": "attachments"}
while url:
data = graph_get(url, params)
for msg in data.get("value", []):
yield msg
url = data.get("@odata.nextLink")
params = None
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def parse_date(raw) -> Optional[datetime]:
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
def normalize_subject(subject: str) -> str:
s = subject.strip()
while True:
m = RE_SUBJECT.match(s)
if not m:
break
s = s[m.end():].strip()
return s
def parse_headers(raw_headers: list) -> dict:
result = {}
for h in raw_headers:
k = h["name"].lower().replace("-", "_")
v = h["value"]
if k in result:
existing = result[k]
if isinstance(existing, list):
existing.append(v)
else:
result[k] = [existing, v]
else:
result[k] = v
return result
def format_recipients(lst: list) -> str:
return "; ".join(
f'{r["emailAddress"].get("name", "")} <{r["emailAddress"].get("address", "")}>'.strip()
for r in lst
)
# ─── Hlavní extrakce ─────────────────────────────────────────────────────────
def extract_message(msg: dict, folder_path: str) -> Optional[dict]:
try:
# _id
mid = (msg.get("internetMessageId") or "").strip()
if not mid:
mid = f"graphid:{msg['id']}"
subject = msg.get("subject") or ""
norm_subject = normalize_subject(subject)
# tělo
body_html = None
body_preview = msg.get("bodyPreview") or ""
body = msg.get("body", {})
if body.get("contentType") == "html":
content = body.get("content") or ""
body_html = content if len(content) <= 2 * 1024 * 1024 else content[:2 * 1024 * 1024]
elif body.get("contentType") == "text":
body_preview = (body.get("content") or "")[:2000]
# odesílatel
sender_ea = (msg.get("from") or msg.get("sender") or {}).get("emailAddress", {})
sender_email = sender_ea.get("address", "")
sender_name = sender_ea.get("name", "")
# příjemci
to_list = msg.get("toRecipients", [])
cc_list = msg.get("ccRecipients", [])
bcc_list = msg.get("bccRecipients", [])
recipients = (
[{"type": "to", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in to_list] +
[{"type": "cc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in cc_list] +
[{"type": "bcc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in bcc_list]
)
# příznaky
importance = IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1)
flag_status = FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0)
# internet headers
raw_headers = msg.get("internetMessageHeaders") or []
headers = parse_headers(raw_headers)
in_reply_to = headers.get("in_reply_to", "")
if isinstance(in_reply_to, list):
in_reply_to = in_reply_to[0]
refs_raw = headers.get("references", "")
if isinstance(refs_raw, list):
refs_raw = " ".join(refs_raw)
internet_refs = [r.strip() for r in refs_raw.split() if r.strip()] if refs_raw else []
conv_topic = headers.get("thread_topic", "")
if isinstance(conv_topic, list):
conv_topic = conv_topic[0]
# conversation index
conv_index = ""
ci_raw = msg.get("conversationIndex")
if ci_raw:
try:
conv_index = base64.b64encode(base64.b64decode(ci_raw)).decode()
except Exception:
conv_index = ci_raw
# přílohy (jen metadata, bez obsahu)
attachments = []
for att in msg.get("attachments") or []:
fname = att.get("name") or ""
if not fname:
continue
attachments.append({
"filename": fname,
"size_bytes": att.get("size", 0),
"mime_type": att.get("contentType", "application/octet-stream"),
"content_id": att.get("contentId"),
"is_inline": att.get("isInline", False),
})
return {
"_id": mid,
"graph_id": msg["id"],
"subject": subject,
"normalized_subject": norm_subject,
"importance": importance,
"flag_status": flag_status,
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"has_attachments": msg.get("hasAttachments", False),
"attachment_count": len(attachments),
"inference_classification": msg.get("inferenceClassification", ""),
"categories": msg.get("categories") or [],
"conversation_id": msg.get("conversationId", ""),
"conversation_index": conv_index,
"conversation_topic": conv_topic,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"received_at": parse_date(msg.get("receivedDateTime")),
"sent_at": parse_date(msg.get("sentDateTime")),
"created_at": parse_date(msg.get("createdDateTime")),
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"sender": {
"email": sender_email,
"name": sender_name,
},
"to": format_recipients(to_list),
"cc": format_recipients(cc_list),
"bcc": format_recipients(bcc_list),
"recipients": recipients,
"body_html": body_html,
"body_preview": body_preview,
"attachments": attachments,
"headers": headers,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg.get("id", "?"), e)
return None
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
def create_indexes(col):
print(" Vytvarim indexy...")
col.create_index([("received_at", ASCENDING)])
col.create_index([("sent_at", ASCENDING)])
col.create_index([("sender.email", ASCENDING)])
col.create_index([("graph_id", ASCENDING)], unique=True, sparse=True)
col.create_index([("conversation_id", ASCENDING)])
col.create_index([("folder_path", ASCENDING)])
col.create_index([("has_attachments", ASCENDING)])
col.create_index([("categories", ASCENDING)])
col.create_index([("importance", ASCENDING)])
col.create_index([("flag_status", ASCENDING)])
col.create_index([("is_read", ASCENDING)])
col.create_index([
("subject", TEXT),
("body_preview", TEXT),
("to", TEXT),
("cc", TEXT),
], name="text_search", default_language="none")
print(" Indexy hotovy.")
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"parse_emails_graph v{SCRIPT_VERSION}")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N zprav (0 = vse)")
ap.add_argument("--skip-existing", action="store_true",
help="Preskocit zpravy ktere jiz jsou v MongoDB")
ap.add_argument("--folder", default="",
help="Zpracovat jen slozku se zadanym nazvem (napr. Inbox)")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
start = datetime.now()
print(f"=== parse_emails_graph v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {GRAPH_MAILBOX}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
# Graph token
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
# MongoDB
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col = client[MONGO_DB][MONGO_COL]
# Skip existing
existing: set = set()
if args.skip_existing:
print(" Nacitam existujici zaznamy z MongoDB...")
existing = set(col.distinct("_id"))
print(f" {len(existing)} jiz importovano")
# Slozky
print("\nNacitam seznam slozek...")
all_folders = get_all_folders()
if args.folder:
all_folders = [f for f in all_folders if args.folder.lower() in f["path"].lower()]
print(f" Slozek ke zpracovani: {len(all_folders)}")
for f in all_folders:
print(f" {f['path']}")
# Import
batch = []
ok_count = 0
err_count = 0
skip_count = 0
total_i = 0
def flush():
if not batch:
return
try:
col.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
print()
for folder in all_folders:
print(f"--- Složka: {folder['path']} ---")
folder_count = 0
for msg in iter_folder_messages(folder["id"]):
if args.limit and total_i >= args.limit:
break
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
if mid in existing:
skip_count += 1
total_i += 1
continue
doc = extract_message(msg, folder["path"])
total_i += 1
folder_count += 1
if doc is None:
err_count += 1
else:
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
ok_count += 1
if len(batch) >= BATCH_SIZE:
flush()
status = "ERR " if doc is None else "OK "
subject_str = (doc.get("subject") or "")[:60] if doc else "?"
sender_str = (doc.get("sender", {}).get("email") or "")[:40] if doc else "?"
print(f" {total_i:>6} {status} {subject_str:<60} {sender_str}")
if total_i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = total_i / elapsed if elapsed > 0 else 0
print(f" {''*80}")
print(f" Průběh: ok={ok_count} skip={skip_count} err={err_count} {rate:.1f} msg/s")
print(f" {''*80}")
flush()
print(f"{folder_count} zprav ze slozky {folder['path']}")
if args.limit and total_i >= args.limit:
break
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"Vysledek: ok={ok_count} | skip={skip_count} | err={err_count}")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"Dokumentu v kolekci: {col.count_documents({})}")
if not args.no_indexes:
print()
create_indexes(col)
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
+611
View File
@@ -0,0 +1,611 @@
"""
parse_emails_graph_v1.3.py
Nazev: parse_emails_graph_v1.3.py
Verze: 1.3
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Cte vsechny emaily z libovolne schranky primo pres Microsoft Graph API
a importuje je jako dokumenty do MongoDB.
Ze kazde zpravy extrahuje vsechny dostupne vlastnosti:
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
- cas doruceni, odeslani, vytvoreni, modifikace (UTC)
- telo HTML (max 2 MB) + textovy preview
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag, graph_att_id)
- internet headers (SPF, DKIM, Received, X-*, ...)
- MAPI-ekvivalenty: dulezitost, priznak, konverzacni vlakno,
kategorie, In-Reply-To, References, ...
- navic: isRead, isDraft, folder_path, inferenceClassification
Prochazi VSECHNY slozky schranky rekurzivne (Inbox, Sent, Deleted,
archivni slozky, ...).
DB: emaily
Kolekce: <mailbox> (napr. ordinace@buzalkova.cz)
_id: Internet Message-ID (nebo "graphid:<id>" jako fallback)
POZOR: Skript pouze CIST ze schranky — zadny zapis do schranky!
Spousteni:
# Prvni import (vsechno):
python parse_emails_graph_v1.3.py --mailbox ordinace@buzalkova.cz
# Test na prvnich 50:
python parse_emails_graph_v1.3.py --mailbox ordinace@buzalkova.cz --limit 50 --no-indexes
# Jen jedna slozka:
python parse_emails_graph_v1.3.py --mailbox ordinace@buzalkova.cz --folder Inbox
# Pokracovani po preruseni (pouze nove):
python parse_emails_graph_v1.3.py --mailbox ordinace@buzalkova.cz --mode new-only
# Pravidelny sync (aktualizuje is_read, flag, slozku; importuje nove):
python parse_emails_graph_v1.3.py --mailbox ordinace@buzalkova.cz --mode sync
# Jina schranka:
python parse_emails_graph_v1.3.py --mailbox vladimir.buzalka@buzalka.cz
Rezimy (--mode):
full Plny upsert vsech poli pro kazdou zpravu (vychozi)
new-only Preskoci zpravy ktere uz jsou v MongoDB, importuje jen nove
sync Existujici: aktualizuje jen is_read/flag_status/categories/
modified_at/folder_path. Nove zpravy importuje cely.
Idealni pro pravidelne spousteni.
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
Struktura dokumentu v MongoDB:
_id Internet Message-ID (nebo graphid: fallback)
graph_id Graph API message ID
subject predmet zpravy
normalized_subject predmet bez RE:/FW:/AW: prefixu
importance 0=nizka 1=normalni 2=vysoka
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
is_read bool — aktualni stav precteni ve schrance
is_draft bool
has_attachments bool
attachment_count int
inference_classification focused / other
categories [str]
conversation_id Graph conversationId
conversation_index base64 conversationIndex
conversation_topic tema vlakna (z internet headers Thread-Topic)
in_reply_to Message-ID predchozi zpravy
internet_references [Message-ID]
received_at datetime UTC
sent_at datetime UTC
created_at datetime UTC
modified_at datetime UTC
folder_id Graph parentFolderId
folder_path cela cesta slozky (napr. Inbox/Subfolder)
sender.email emailova adresa odesilatele
sender.name zobrazovane jmeno
to retezec To (joined)
cc retezec CC
bcc retezec BCC
recipients [{type, email, name}]
body_html HTML telo (max 2 MB)
body_preview textovy nahled (max 255 znaku)
attachments [{filename, size_bytes, mime_type, is_inline, graph_att_id}]
headers dict internet headers
parsed_at datetime UTC
Indexy:
received_at, sent_at, sender.email, graph_id (unique),
conversation_id, folder_path, has_attachments, categories,
importance, flag_status, is_read,
text_search (subject + body_preview + to + cc)
Historie verzi:
1.0 2026-06-02 Inicialni verze
1.1 2026-06-02 Pridany rezimy --mode full/new-only/sync;
odstranen --skip-existing (nahrazen --mode new-only)
1.2 2026-06-02 $expand attachments s $select (bez contentBytes — rychlejsi);
prilohy ukladaji graph_att_id pro prime stazeni bez name-matchingu
1.3 2026-06-02 --mailbox jako povinny parametr — univerzalni pouziti pro
libovolnou schranku; kolekce v MongoDB = nazev schranky
"""
import sys
import re
import logging
import argparse
import base64
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import msal
import requests
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
BATCH_SIZE = 100
PAGE_SIZE = 50
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.3"
# Schránka se nastavuje za behu z --mailbox parametru
GRAPH_MAILBOX: str = ""
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
IMPORTANCE_MAP = {"low": 0, "normal": 1, "high": 2}
FLAG_STATUS_MAP = {"notFlagged": 0, "flagged": 1, "complete": 2}
RE_SUBJECT = re.compile(r"^(RE|FW|AW|SV|VS|TR|WG|odpov[eě]d[ťt]|fwd?)[:\s]+", re.IGNORECASE)
# $expand prilohy bez contentBytes — jen metadata co potrebujeme
ATT_EXPAND = "attachments($select=id,name,contentType,size,isInline)"
MSG_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification,internetMessageHeaders"
)
MSG_SELECT_SYNC = (
"id,internetMessageId,isRead,isDraft,flag,categories,"
"lastModifiedDateTime,parentFolderId,importance"
)
# ─── Graph API helpers ────────────────────────────────────────────────────────
_graph_token: Optional[str] = None
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
def graph_get(url: str, params: dict = None) -> dict:
global _graph_token
if not _graph_token:
get_token()
for attempt in range(2):
r = requests.get(url, headers={"Authorization": f"Bearer {_graph_token}"}, params=params, timeout=30)
if r.status_code == 401:
get_token()
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET failed after retry: {url}")
def get_all_folders(parent_id: str = None, parent_path: str = "") -> list[dict]:
"""Rekurzivne nacte vsechny slozky schranky. Vraci [{id, path}]."""
if parent_id is None:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders"
else:
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{parent_id}/childFolders"
folders = []
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
while url:
data = graph_get(url, params)
for f in data.get("value", []):
path = f"{parent_path}/{f['displayName']}".lstrip("/")
folders.append({"id": f["id"], "path": path})
if f.get("childFolderCount", 0) > 0:
folders.extend(get_all_folders(f["id"], path))
url = data.get("@odata.nextLink")
params = None
return folders
def iter_folder_messages(folder_id: str, select: str = MSG_SELECT, expand_attachments: bool = True):
"""Generator: vraci zpravy ze slozky po strankach."""
url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/mailFolders/{folder_id}/messages"
params = {"$top": PAGE_SIZE, "$select": select}
if expand_attachments:
params["$expand"] = ATT_EXPAND
while url:
data = graph_get(url, params)
for msg in data.get("value", []):
yield msg
url = data.get("@odata.nextLink")
params = None
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def parse_date(raw) -> Optional[datetime]:
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
def normalize_subject(subject: str) -> str:
s = subject.strip()
while True:
m = RE_SUBJECT.match(s)
if not m:
break
s = s[m.end():].strip()
return s
def parse_headers(raw_headers: list) -> dict:
result = {}
for h in raw_headers:
k = h["name"].lower().replace("-", "_")
v = h["value"]
if k in result:
existing = result[k]
result[k] = existing + [v] if isinstance(existing, list) else [existing, v]
else:
result[k] = v
return result
def format_recipients(lst: list) -> str:
return "; ".join(
f'{r["emailAddress"].get("name", "")} <{r["emailAddress"].get("address", "")}>'.strip()
for r in lst
)
# ─── Extrakce zprávy ─────────────────────────────────────────────────────────
def extract_message(msg: dict, folder_path: str) -> Optional[dict]:
"""Plna extrakce — pouziva se pro mode full a nove zpravy v sync/new-only."""
try:
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
subject = msg.get("subject") or ""
body_html = None
body_preview = msg.get("bodyPreview") or ""
body = msg.get("body", {})
if body.get("contentType") == "html":
content = body.get("content") or ""
body_html = content if len(content) <= 2 * 1024 * 1024 else content[:2 * 1024 * 1024]
elif body.get("contentType") == "text":
body_preview = (body.get("content") or "")[:2000]
sender_ea = (msg.get("from") or msg.get("sender") or {}).get("emailAddress", {})
to_list = msg.get("toRecipients", [])
cc_list = msg.get("ccRecipients", [])
bcc_list = msg.get("bccRecipients", [])
recipients = (
[{"type": "to", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in to_list] +
[{"type": "cc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in cc_list] +
[{"type": "bcc", "email": r["emailAddress"].get("address",""), "name": r["emailAddress"].get("name","")} for r in bcc_list]
)
importance = IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1)
flag_status = FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0)
raw_headers = msg.get("internetMessageHeaders") or []
headers = parse_headers(raw_headers)
in_reply_to = headers.get("in_reply_to", "")
if isinstance(in_reply_to, list):
in_reply_to = in_reply_to[0]
refs_raw = headers.get("references", "")
if isinstance(refs_raw, list):
refs_raw = " ".join(refs_raw)
internet_refs = [r.strip() for r in refs_raw.split() if r.strip()] if refs_raw else []
conv_topic = headers.get("thread_topic", "")
if isinstance(conv_topic, list):
conv_topic = conv_topic[0]
conv_index = ""
ci_raw = msg.get("conversationIndex")
if ci_raw:
try:
conv_index = base64.b64encode(base64.b64decode(ci_raw)).decode()
except Exception:
conv_index = ci_raw
attachments = []
for att in msg.get("attachments") or []:
fname = att.get("name") or ""
if not fname:
continue
attachments.append({
"filename": fname,
"size_bytes": att.get("size", 0),
"mime_type": att.get("contentType", "application/octet-stream"),
"is_inline": att.get("isInline", False),
"graph_att_id": att.get("id"),
})
return {
"_id": mid,
"graph_id": msg["id"],
"subject": subject,
"normalized_subject": normalize_subject(subject),
"importance": importance,
"flag_status": flag_status,
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"has_attachments": msg.get("hasAttachments", False),
"attachment_count": len(attachments),
"inference_classification": msg.get("inferenceClassification", ""),
"categories": msg.get("categories") or [],
"conversation_id": msg.get("conversationId", ""),
"conversation_index": conv_index,
"conversation_topic": conv_topic,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"received_at": parse_date(msg.get("receivedDateTime")),
"sent_at": parse_date(msg.get("sentDateTime")),
"created_at": parse_date(msg.get("createdDateTime")),
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"sender": {
"email": sender_ea.get("address", ""),
"name": sender_ea.get("name", ""),
},
"to": format_recipients(to_list),
"cc": format_recipients(cc_list),
"bcc": format_recipients(bcc_list),
"recipients": recipients,
"body_html": body_html,
"body_preview": body_preview,
"attachments": attachments,
"headers": headers,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg.get("id", "?"), e)
return None
def extract_sync_fields(msg: dict, folder_path: str) -> dict:
"""Jen menitelna pole — pouziva se v sync mode pro existujici zpravy."""
return {
"is_read": msg.get("isRead", False),
"is_draft": msg.get("isDraft", False),
"flag_status": FLAG_STATUS_MAP.get((msg.get("flag") or {}).get("flagStatus", "notFlagged"), 0),
"importance": IMPORTANCE_MAP.get(msg.get("importance", "normal"), 1),
"categories": msg.get("categories") or [],
"modified_at": parse_date(msg.get("lastModifiedDateTime")),
"folder_id": msg.get("parentFolderId", ""),
"folder_path": folder_path,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
def create_indexes(col):
print(" Vytvarim indexy...")
col.create_index([("received_at", ASCENDING)])
col.create_index([("sent_at", ASCENDING)])
col.create_index([("sender.email", ASCENDING)])
col.create_index([("graph_id", ASCENDING)], unique=True, sparse=True)
col.create_index([("conversation_id", ASCENDING)])
col.create_index([("folder_path", ASCENDING)])
col.create_index([("has_attachments", ASCENDING)])
col.create_index([("categories", ASCENDING)])
col.create_index([("importance", ASCENDING)])
col.create_index([("flag_status", ASCENDING)])
col.create_index([("is_read", ASCENDING)])
col.create_index([
("subject", TEXT),
("body_preview", TEXT),
("to", TEXT),
("cc", TEXT),
], name="text_search", default_language="none")
print(" Indexy hotovy.")
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
global GRAPH_MAILBOX
ap = argparse.ArgumentParser(description=f"parse_emails_graph v{SCRIPT_VERSION}")
ap.add_argument("--mailbox", required=True,
help="Emailova schranka (napr. ordinace@buzalkova.cz)")
ap.add_argument("--mode", default="full", choices=["full", "new-only", "sync"],
help="full=plny upsert (vychozi) | new-only=jen nove zpravy | "
"sync=existujici aktualizuje jen menitelna pole, nove importuje cely")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N zprav (0 = vse)")
ap.add_argument("--folder", default="",
help="Zpracovat jen slozku se zadanym nazvem (napr. Inbox)")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
GRAPH_MAILBOX = args.mailbox
mongo_col = args.mailbox
start = datetime.now()
print(f"=== parse_emails_graph v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Schránka: {GRAPH_MAILBOX}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{mongo_col}")
print(f"Režim: {args.mode}")
print("\nPřipojuji se k Graph API...")
try:
get_token()
print(" Graph API OK")
except Exception as e:
print(f" CHYBA: {e}")
sys.exit(1)
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col = client[MONGO_DB][mongo_col]
existing: set = set()
if args.mode in ("new-only", "sync"):
print(" Nacitam existujici zaznamy z MongoDB...")
existing = set(col.distinct("_id"))
print(f" {len(existing)} jiz importovano")
print("\nNacitam seznam slozek...")
all_folders = get_all_folders()
if args.folder:
all_folders = [f for f in all_folders if args.folder.lower() in f["path"].lower()]
print(f" Slozek ke zpracovani: {len(all_folders)}")
for f in all_folders:
print(f" {f['path']}")
is_sync = args.mode == "sync"
msg_select = MSG_SELECT_SYNC if is_sync else MSG_SELECT
expand_att = not is_sync
batch = []
ok_count = 0
sync_count = 0
err_count = 0
skip_count = 0
total_i = 0
def flush():
if not batch:
return
try:
col.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
print()
for folder in all_folders:
print(f"--- Složka: {folder['path']} ---")
folder_count = 0
for msg in iter_folder_messages(folder["id"], select=msg_select, expand_attachments=expand_att):
if args.limit and total_i >= args.limit:
break
mid = (msg.get("internetMessageId") or "").strip() or f"graphid:{msg['id']}"
total_i += 1
folder_count += 1
if args.mode == "new-only" and mid in existing:
skip_count += 1
continue
if is_sync and mid in existing:
fields = extract_sync_fields(msg, folder["path"])
batch.append(UpdateOne({"_id": mid}, {"$set": fields}))
sync_count += 1
print(f" {total_i:>6} SYN {mid[:80]}")
else:
if is_sync:
full_url = f"{GRAPH_URL}/users/{GRAPH_MAILBOX}/messages/{msg['id']}"
full_params = {"$select": MSG_SELECT, "$expand": ATT_EXPAND}
try:
msg = graph_get(full_url, full_params)
except Exception as e:
logging.error("full fetch failed [%s]: %s", msg.get("id","?"), e)
err_count += 1
continue
doc = extract_message(msg, folder["path"])
if doc is None:
err_count += 1
print(f" {total_i:>6} ERR {mid[:80]}")
else:
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
ok_count += 1
subject_str = (doc.get("subject") or "")[:60]
sender_str = (doc.get("sender", {}).get("email") or "")[:40]
print(f" {total_i:>6} OK {subject_str:<60} {sender_str}")
if len(batch) >= BATCH_SIZE:
flush()
if total_i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = total_i / elapsed if elapsed > 0 else 0
print(f" {''*80}")
print(f" Průběh: ok={ok_count} sync={sync_count} skip={skip_count} err={err_count} {rate:.1f} msg/s")
print(f" {''*80}")
flush()
print(f"{folder_count} zprav ze slozky {folder['path']}")
if args.limit and total_i >= args.limit:
break
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"Vysledek: ok={ok_count} | sync={sync_count} | skip={skip_count} | err={err_count}")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"Dokumentu v kolekci: {col.count_documents({})}")
if not args.no_indexes:
print()
create_indexes(col)
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
+248
View File
@@ -0,0 +1,248 @@
# parse_emails_tower_v1.1
## Spuštění
**První spuštění:**
```bash
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.1.py > /scripts/parse_emails.log 2>&1"
```
**Pokračování po přerušení (přeskočí už importované):**
```bash
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.1.py --skip-existing > /scripts/parse_emails.log 2>&1"
```
---
## Stav importu
**Sledování průběhu (live log):**
```bash
docker exec -it python-runner tail -f /scripts/parse_emails.log
```
**Počet emailů v MongoDB:**
```bash
docker exec -it python-runner python -c \
"from pymongo import MongoClient; c=MongoClient('mongodb://192.168.1.76:27017'); print(c['emaily']['vbuzalka@its.jnj.com'].count_documents({}))"
```
---
**Název:** parse_emails_tower_v1.1.py
**Verze:** 1.1
**Datum:** 2026-06-02
**Autor:** vladimir.buzalka
---
## Účel
Import všech `.msg` souborů do MongoDB. Z každého souboru extrahuje **všechny dostupné vlastnosti** — podobně jako EXIF u fotek.
- **DB:** `emaily`
- **Kolekce:** `vbuzalka@its.jnj.com`
- `_id` = Internet Message-ID (nebo `filename:<stem>` jako fallback)
- Bezpečné přerušit a opakovat — upsert podle `_id`
---
## Prostředí
Běží v Docker containeru **python-runner** na **Unraid Tower**.
| Komponenta | Umístění |
|---|---|
| Container | `python-runner` (Docker na Unraid Tower) |
| .msg soubory | `/mnt/user/JNJEMAILS``/mnt/JNJEMAILS` uvnitř containeru |
| Skripty | `/mnt/user/Scripts``/scripts` uvnitř containeru |
| MongoDB | `192.168.1.76:27017` (externí, mimo container) |
---
## Spouštění (z Unraid terminálu)
**Test na 50 emailech:**
```bash
docker exec -it python-runner python /scripts/parse_emails_tower_v1.1.py --limit 50 --no-indexes
```
**Kompletní import na pozadí (log do souboru):**
```bash
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.1.py > /scripts/parse_emails.log 2>&1"
```
**Pokračování po přerušení:**
```bash
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.1.py --skip-existing > /scripts/parse_emails.log 2>&1"
```
**Sledování průběhu (Ctrl+C ukončí sledování, import běží dál):**
```bash
docker exec -it python-runner tail -f /scripts/parse_emails.log
```
### Všechny parametry
| Parametr | Popis |
|---|---|
| `--skip-existing` | Načte seznam hotových souborů z MongoDB a přeskočí je. Použij pro pokračování po přerušení. |
| `--limit N` | Zpracuje jen prvních N souborů. Vhodné pro test. |
| `--no-indexes` | Nevytváří indexy na konci. Použij pokud přerušíš uprostřed — indexy vytvoř ručně až je vše hotové. |
| `--msgs-dir PATH` | Přepíše výchozí cestu k .msg souborům (výchozí: `/mnt/JNJEMAILS`). |
---
## Průběh na konzoli
Každý email na jednom řádku:
```
1/69371 OK RE: Protocol deviation CZ10022 jan.novak@its.jnj.com
2/69371 OK UCO3001: Draft FUL pro DD5-CZ10022 monitor@4gclinical.com
3/69371 ERR ? ?
```
Každých 500 emailů oddělovač s průběhem:
```
────────────────────────────────────────────────────────────────────────────────
Průběh: ok=498 err=2 0.4 msg/s ETA 47h12m
────────────────────────────────────────────────────────────────────────────────
```
Na konci souhrn:
```
====================================================
Vysledek: ok=69300 | skip=0 | err=71
Celkovy cas: 47h 23m 10s
Dokumentu v kolekci: 69300
```
---
## Zdroje dat z každého .msg
| Pole | Popis |
|---|---|
| Předmět, normalized subject | |
| Odesílatel | email, jméno, SMTP adresa |
| Příjemci To/CC/BCC | strukturovaně `[{type, email, name}]` |
| Čas doručení a odeslání | UTC |
| Tělo | plaintext + HTML (max 2 MB) |
| Přílohy | metadata: jméno, velikost, MIME typ, inline flag |
| Internet headers | X-Originating-IP, Received, DKIM, X-Mailer, ... |
| MAPI | důležitost, citlivost, příznak, konverzační vlákno, kategorie |
| In-Reply-To, References | pro rekonstrukci vlákna |
| Raw MAPI properties | `{0xXXXX: value}` |
---
## Hodnotové kódy
| Pole | Hodnota | Význam |
|---|---|---|
| `importance` | 0 | Nízká |
| | 1 | Normální |
| | 2 | Vysoká |
| `sensitivity` | 0 | Normální |
| | 1 | Osobní |
| | 2 | Soukromé |
| | 3 | Důvěrné |
| `flag_status` | 0 | Bez příznaku |
| | 1 | Označeno (follow up) |
| | 2 | Dokončeno |
---
## MongoDB indexy
Automaticky vytvořeny na konci importu (`--no-indexes` přeskočí):
| Index | Pole |
|---|---|
| Chronologický | `received_at`, `sent_at` |
| Odesílatel | `sender.email` |
| Soubor | `filename` (unique) |
| Konverzace | `conversation_topic` |
| Filtry | `has_attachments`, `categories`, `importance`, `flag_status` |
| Full-text | `subject` + `body_text` + `to` + `cc` (text index `text_search`) |
---
## Ukázkové dotazy (MongoDB shell / MCP)
**Emaily o UCO3001 s přílohou:**
```javascript
db["vbuzalka@its.jnj.com"].find({
$text: { $search: "UCO3001" },
has_attachments: true
}).sort({ received_at: -1 })
```
**Emaily od konkrétního odesílatele:**
```javascript
db["vbuzalka@its.jnj.com"].find({
"sender.email": /covance/i
}).sort({ received_at: -1 })
```
**Celé konverzační vlákno:**
```javascript
db["vbuzalka@its.jnj.com"].find({
conversation_topic: "Protocol deviation CZ10022"
}).sort({ received_at: 1 })
```
**Statistiky podle odesílatele (top 20):**
```javascript
db["vbuzalka@its.jnj.com"].aggregate([
{ $group: { _id: "$sender.email", count: { $sum: 1 } } },
{ $sort: { count: -1 } },
{ $limit: 20 }
])
```
---
## Chybový log
Soubory které selhaly jsou zalogrovány do `parse_emails_errors.log` vedle skriptu (tj. `/scripts/parse_emails_errors.log``\\tower\Scripts\parse_emails_errors.log`):
```
2026-06-02 20:14:33 | open failed [7A3F...0000.msg]: <důvod>
```
---
## Výkon
| Parametr | Hodnota |
|---|---|
| Počet souborů | ~69 000 |
| Rychlost | ~0.4 msg/s (htmlBody dekódování) |
| Odhadovaný čas | 48 hodin |
| Batch size | 200 dokumentů / bulk_write |
| Odhadovaná velikost DB | 25 GB |
---
## Závislosti (v Docker image python-runner)
```
extract-msg==0.55.0
pymongo
python-dateutil
```
Image sestaven z `Dockerfile` v `/mnt/user/Scripts/python-runner/`.
---
## Historie verzí
| Verze | Datum | Změna |
|---|---|---|
| 1.0 | 2026-06-01 | Iniciální verze |
| 1.1 | 2026-06-02 | Nasazení na Unraid Tower v Docker containeru python-runner; MSGS_DIR změněno z SMB share (`\\tower\JNJEMAILS`) na lokální mount (`/mnt/JNJEMAILS`); aktualizován popis spouštění pro `docker exec` |
+660
View File
@@ -0,0 +1,660 @@
"""
parse_emails_tower_v1.1.py
Nazev: parse_emails_tower_v1.1.py
Verze: 1.1
Datum: 2026-06-02
Autor: vladimir.buzalka
Popis:
Parsuje vsechny .msg soubory z MSGS_DIR a importuje je jako dokumenty
do MongoDB. Z kazdeho souboru extrahuje VSECHNY dostupne vlastnosti —
podobne jako EXIF u fotek:
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
- cas doruceni a odeslani (UTC)
- telo plaintext + HTML (max 2 MB)
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag)
- internet headers (X-Originating-IP, Received, DKIM, ...)
- MAPI vlastnosti: dulezitost, citlivost, priznak, konverzacni vlakno,
kategorie, In-Reply-To, References, ...
- vsechny raw MAPI properties jako {0xXXXX: value}
DB: emaily
Kolekce: vbuzalka@its.jnj.com
_id: Internet Message-ID (nebo "filename:<stem>" jako fallback)
Bezpecne prerusit a opakovat:
- upsert podle _id — duplicity se automaticky prepisi
- --skip-existing nacte seznam hotovych souboru z MongoDB a
preskoci je => pokracovani po preruseni bez ztraty prace
Prostredi:
Bezi v Docker containeru "python-runner" na Unraid Tower.
.msg soubory jsou dostupne jako lokalni disk (volume mount):
/mnt/user/JNJEMAILS -> /mnt/JNJEMAILS (uvnitr containeru)
MongoDB na 192.168.1.76:27017 (externi, bezi mimo container).
Spousteni (z Unraid terminalu):
# Test na 50 emailech:
docker exec -it python-runner python /scripts/parse_emails_tower_v1.1.py --limit 50 --no-indexes
# Kompletni import na pozadi (log do souboru):
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.1.py > /scripts/parse_emails.log 2>&1"
# Pokracovani po preruseni:
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.1.py --skip-existing > /scripts/parse_emails.log 2>&1"
# Sledovani prubehu:
docker exec -it python-runner tail -f /scripts/parse_emails.log
Vystup na konzoli:
Kazdy email na jednom radku:
<poradi>/<celkem> OK/ERR <predmet 60 znaku> <odesilatel>
Kazych 500 emailu: oddelovac s prubehem, rychlosti a ETA.
Na konci: souhrn ok/skip/err, celkovy cas, pocet dokumentu v kolekci.
Zavislosti (nainstalovane v Docker image python-runner):
extract-msg==0.55.0, pymongo, python-dateutil
Python 3.12, Linux (Docker container na Unraid Tower)
Struktura dokumentu v MongoDB:
_id Internet Message-ID (nebo filename: fallback)
filename jmeno .msg souboru (20znakovy hex + .msg)
subject predmet zpravy
normalized_subject predmet bez RE:/FW: prefixu
importance 0=nizka 1=normalni 2=vysoka
sensitivity 0=normalni 1=osobni 2=soukrome 3=duverne
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
read_receipt_requested bool
delivery_receipt_requested bool
has_attachments bool
attachment_count int
message_size_bytes velikost .msg souboru na disku
conversation_topic tema vlakna (PR_CONVERSATION_TOPIC)
conversation_index base64 PR_CONVERSATION_INDEX
in_reply_to Message-ID predchozi zpravy
internet_references [Message-ID] — cela historia vlakna
categories [str] — MAPI kategorie / stitky
read_receipt_requested bool
delivery_receipt_requested bool
received_at datetime UTC — cas doruceni
sent_at datetime UTC — cas odeslani
sender.email emailova adresa odesilatele
sender.name zobrazovane jmeno odesilatele
sender.smtp SMTP adresa (pro interni EX adresy)
to retezec To (tak jak v Outlooku)
cc retezec CC
bcc retezec BCC
display_to PR_DISPLAY_TO (zkraceny seznam)
display_cc PR_DISPLAY_CC
recipients [{type, email, name}] — to/cc/bcc s typy
body_text plain text telo
body_html HTML telo (max 2 MB, None pokud neni)
attachments [{filename, size_bytes, mime_type,
content_id, is_inline}]
headers dict internet headers (lowercase_s_podtrzitky)
mapi dict vsech raw MAPI properties {0xXXXX: value}
parsed_at datetime UTC — cas parsovani
Indexy (vytvoreny automaticky na konci):
received_at, sent_at, sender.email, filename (unique),
conversation_topic, has_attachments, categories, importance,
flag_status, text_search (subject + body_text + to + cc)
Chyby:
Soubory ktere selhaly jsou zalogiovany do parse_emails_errors.log
v adresari skriptu. Radek: timestamp | open/extract failed | duvod.
Historie verzi:
1.0 2026-06-01 Inicialni verze
1.1 2026-06-02 Nasazeni na Unraid Tower v Docker containeru python-runner;
MSGS_DIR zmeneno z SMB share na lokalni mount /mnt/JNJEMAILS;
aktualizovany popis spousteni pro docker exec
"""
import sys
import re
import logging
import argparse
import base64
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import extract_msg
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
MSGS_DIR = Path("/mnt/JNJEMAILS")
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL = "vbuzalka@its.jnj.com"
BATCH_SIZE = 200
LOG_FILE = Path(__file__).parent / "parse_emails_errors.log"
SCRIPT_VERSION = "1.1"
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def safe(obj, *attrs, default=None):
"""Bezpecne cteni atributu — vrati prvni non-None hodnotu."""
for attr in attrs:
try:
val = getattr(obj, attr, None)
if val is None:
continue
if isinstance(val, str) and not val.strip():
continue
return val
except Exception:
continue
return default
def parse_date(raw) -> Optional[datetime]:
"""Libovolny datum -> UTC datetime bez tzinfo (pro MongoDB)."""
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
def to_bson(val):
"""Konvertuje hodnotu na BSON-serializovatelny typ."""
if isinstance(val, bytes):
return val.hex() if len(val) <= 128 else f"<bytes:{len(val)}>"
if isinstance(val, datetime):
return parse_date(val)
if isinstance(val, (str, int, float, bool, type(None))):
return val
if isinstance(val, list):
return [to_bson(v) for v in val]
try:
return int(val)
except Exception:
pass
return str(val)
# ─── Extrakce částí zprávy ────────────────────────────────────────────────────
def extract_headers(msg) -> dict:
headers = {}
try:
hdr = msg.header
if not hdr:
return {}
from email.header import decode_header as _dh
def _decode(v: str) -> str:
try:
parts = _dh(v)
out = ""
for part, enc in parts:
out += part.decode(enc or "utf-8", errors="replace") if isinstance(part, bytes) else part
return out
except Exception:
return v
for key in set(hdr.keys()):
k = key.lower().replace("-", "_")
vals = [_decode(v) for v in hdr.get_all(key, [])]
headers[k] = vals if len(vals) > 1 else (vals[0] if vals else "")
except Exception as e:
logging.error("extract_headers: %s", e)
return headers
def extract_recipients(msg) -> list:
result = []
type_map = {1: "to", 2: "cc", 3: "bcc"}
try:
for r in msg.recipients:
rtype = getattr(r, "type", 1)
try:
rtype = int(rtype)
except Exception:
try:
rtype = int(rtype.value)
except Exception:
rtype = 1
rec = {
"type": type_map.get(rtype, "to"),
"email": safe(r, "email", default=""),
"name": safe(r, "name", default=""),
}
result.append(rec)
except Exception as e:
logging.error("extract_recipients: %s", e)
return result
def extract_attachments(msg) -> list:
result = []
try:
for att in msg.attachments:
fname = safe(att, "longFilename", "shortFilename", default="")
if not fname:
continue
size = 0
try:
d = att.data
size = len(d) if d else 0
except Exception:
pass
result.append({
"filename": fname,
"size_bytes": size,
"mime_type": safe(att, "mimetype", "mimeType", default="application/octet-stream"),
"content_id": safe(att, "cid", default=None),
"is_inline": bool(safe(att, "isInline", default=False)),
})
except Exception as e:
logging.error("extract_attachments: %s", e)
return result
def extract_mapi_props(msg) -> dict:
"""Vsechny raw MAPI properties jako {0xXXXX: value}."""
result = {}
try:
props = msg.props
if not hasattr(props, "items"):
return {}
for key, prop in props.items():
try:
val = to_bson(prop.value)
prop_id = f"0x{key[:4].upper()}" if len(key) >= 4 else f"0x{key.upper()}"
result[prop_id] = val
except Exception:
pass
except Exception as e:
logging.error("extract_mapi_props: %s", e)
return result
# ─── Hlavní extrakce ─────────────────────────────────────────────────────────
def extract_message(msg_path: Path) -> Optional[dict]:
"""Parsuje jeden .msg soubor -> MongoDB dokument."""
try:
msg = extract_msg.Message(str(msg_path))
except Exception as e:
logging.error("open failed [%s]: %s", msg_path.name, e)
return None
try:
# ── Message-ID ────────────────────────────────────────────────
mid = None
for attr in ("messageId", "message_id", "internetMessageId"):
mid = safe(msg, attr)
if mid:
break
if not mid:
mid = f"filename:{msg_path.stem}"
mid = str(mid).strip()
# ── Předmět ───────────────────────────────────────────────────
try:
subject = msg.subject or ""
except Exception:
subject = ""
normalized_subject = safe(msg, "normalizedSubject", "normalized_subject", default="")
# ── Tělo ──────────────────────────────────────────────────────
try:
body_text = msg.body or ""
except Exception:
body_text = ""
body_html = None
try:
bh = msg.htmlBody
if isinstance(bh, bytes):
bh = bh.decode("utf-8", errors="replace")
if bh:
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
except Exception:
pass
# ── Odesílatel ────────────────────────────────────────────────
try:
sender_email = msg.sender or ""
except Exception:
sender_email = ""
sender_name = safe(msg, "senderName", "sender_name", default="")
sender_smtp = safe(msg, "senderSmtpAddress", "sent_representing_smtp_address", default="")
# ── Příjemci ──────────────────────────────────────────────────
recipients = extract_recipients(msg)
try:
to_raw = msg.to or ""
except Exception:
to_raw = ""
try:
cc_raw = msg.cc or ""
except Exception:
cc_raw = ""
try:
bcc_raw = getattr(msg, "bcc", None) or ""
except Exception:
bcc_raw = ""
display_to = safe(msg, "displayTo", "display_to", default="")
display_cc = safe(msg, "displayCc", "display_cc", default="")
# ── Časy ──────────────────────────────────────────────────────
try:
received_at = parse_date(msg.date)
except Exception:
received_at = None
sent_at = None
for attr in ("clientSubmitTime", "client_submit_time", "sentOn"):
v = safe(msg, attr)
if v:
sent_at = parse_date(v)
break
# ── MAPI vlastnosti ───────────────────────────────────────────
importance = 1
try:
v = msg.importance
if v is not None:
importance = int(v)
except Exception:
pass
sensitivity = 0
try:
v = getattr(msg, "sensitivity", None)
if v is not None:
sensitivity = int(v)
except Exception:
pass
flag_status = 0
try:
v = safe(msg, "flagStatus", "flag_status")
if v is not None:
flag_status = int(v)
except Exception:
pass
conversation_topic = safe(msg, "conversationTopic", "conversation_topic", default="")
conversation_index = ""
try:
ci = safe(msg, "conversationIndex", "conversation_index")
if isinstance(ci, bytes):
conversation_index = base64.b64encode(ci).decode()
elif ci:
conversation_index = str(ci)
except Exception:
pass
in_reply_to = safe(msg, "inReplyTo", "in_reply_to", default="")
internet_refs = []
try:
refs = safe(msg, "internetReferences", "internet_references")
if isinstance(refs, list):
internet_refs = refs
elif isinstance(refs, str) and refs:
internet_refs = [r.strip() for r in refs.split() if r.strip()]
except Exception:
pass
categories = []
try:
cats = safe(msg, "categories")
if isinstance(cats, list):
categories = [str(c) for c in cats if c]
elif isinstance(cats, str) and cats:
categories = [c.strip() for c in re.split(r"[;,]", cats) if c.strip()]
except Exception:
pass
read_receipt = bool(safe(msg, "readReceiptRequested", "read_receipt_requested", default=False))
delivery_receipt = bool(safe(msg, "deliveryReceiptRequested", "delivery_receipt_requested", default=False))
# ── Internet headers ──────────────────────────────────────────
headers = extract_headers(msg)
if not in_reply_to:
in_reply_to = headers.get("in_reply_to", "")
if not internet_refs:
refs_str = headers.get("references", "")
if isinstance(refs_str, str) and refs_str:
internet_refs = [r.strip() for r in refs_str.split() if r.strip()]
# ── Přílohy ───────────────────────────────────────────────────
attachments = extract_attachments(msg)
# ── Raw MAPI ──────────────────────────────────────────────────
mapi_raw = extract_mapi_props(msg)
msg.close()
# ── Dokument ──────────────────────────────────────────────────
return {
"_id": mid,
"filename": msg_path.name,
"subject": subject,
"normalized_subject": normalized_subject,
"importance": importance,
"sensitivity": sensitivity,
"flag_status": flag_status,
"read_receipt_requested": read_receipt,
"delivery_receipt_requested": delivery_receipt,
"has_attachments": len(attachments) > 0,
"attachment_count": len(attachments),
"message_size_bytes": msg_path.stat().st_size,
"conversation_topic": conversation_topic,
"conversation_index": conversation_index,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"categories": categories,
"received_at": received_at,
"sent_at": sent_at,
"sender": {
"email": sender_email,
"name": sender_name,
"smtp": sender_smtp,
},
"to": to_raw,
"cc": cc_raw,
"bcc": bcc_raw,
"display_to": display_to,
"display_cc": display_cc,
"recipients": recipients,
"body_text": body_text,
"body_html": body_html,
"attachments": attachments,
"headers": headers,
"mapi": mapi_raw,
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg_path.name, e)
return None
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
def create_indexes(col):
print(" Vytvarim indexy...")
col.create_index([("received_at", ASCENDING)])
col.create_index([("sent_at", ASCENDING)])
col.create_index([("sender.email", ASCENDING)])
col.create_index([("filename", ASCENDING)], unique=True, sparse=True)
col.create_index([("conversation_topic", ASCENDING)])
col.create_index([("has_attachments", ASCENDING)])
col.create_index([("categories", ASCENDING)])
col.create_index([("importance", ASCENDING)])
col.create_index([("flag_status", ASCENDING)])
col.create_index([
("subject", TEXT),
("body_text", TEXT),
("to", TEXT),
("cc", TEXT),
], name="text_search", default_language="none")
print(" Indexy hotovy.")
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"parse_emails v{SCRIPT_VERSION}")
ap.add_argument("--msgs-dir", default=str(MSGS_DIR),
help="Cesta k .msg souborum")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N souboru (0 = vse)")
ap.add_argument("--skip-existing", action="store_true",
help="Preskocit soubory ktere jiz jsou v MongoDB (pokracovani)")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
msgs_dir = Path(args.msgs_dir)
start = datetime.now()
print(f"=== parse_emails v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Zdroj: {msgs_dir}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
# MongoDB
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col = client[MONGO_DB][MONGO_COL]
# Skip existing — nacti seznam uz importovanych souboru
existing: set = set()
if args.skip_existing:
print(" Nacitam existujici zaznamy z MongoDB...")
existing = set(col.distinct("filename"))
print(f" {len(existing)} jiz importovano")
# Scan
print(f"\nSkenuji {msgs_dir} ...")
all_files = sorted(msgs_dir.glob("*.msg"))
if args.limit:
all_files = all_files[:args.limit]
to_process = [f for f in all_files if f.name not in existing]
skipped = len(all_files) - len(to_process)
total = len(to_process)
print(f" Celkem .msg: {len(all_files)}")
print(f" Preskoceno: {skipped}")
print(f" Ke zpracovani: {total}\n")
if total == 0:
print("Neni co importovat.")
client.close()
return
batch = []
ok_count = 0
err_count = 0
def flush():
if not batch:
return
try:
col.bulk_write(batch, ordered=False)
except Exception as e:
logging.error("bulk_write: %s", e)
print(f" CHYBA bulk_write: {e}")
batch.clear()
for i, msg_path in enumerate(to_process, 1):
doc = extract_message(msg_path)
if doc is None:
err_count += 1
else:
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
ok_count += 1
if len(batch) >= BATCH_SIZE:
flush()
# Výpis každého emailu
status = "ERR " if doc is None else "OK "
subject_str = (doc.get("subject") or "")[:60] if doc else "?"
sender_str = (doc.get("sender", {}).get("email") or "")[:40] if doc else "?"
print(f" {i:>6}/{total} {status} {subject_str:<60} {sender_str}")
if i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = i / elapsed if elapsed > 0 else 0
eta_s = int((total - i) / rate) if rate > 0 else 0
print(f" {''*80}")
print(f" Průběh: ok={ok_count} err={err_count} "
f"{rate:.1f} msg/s ETA {eta_s//3600}h{(eta_s%3600)//60}m")
print(f" {''*80}")
flush()
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"Vysledek: ok={ok_count} | skip={skipped} | err={err_count}")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"Dokumentu v kolekci: {col.count_documents({})}")
if not args.no_indexes:
print()
create_indexes(col)
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
+122
View File
@@ -0,0 +1,122 @@
# python-runner — Docker kontejner na Tower
## Základní info
| Parametr | Hodnota |
|----------------|----------------------------------------------|
| Název | python-runner |
| Image | python-runner (vlastní) |
| Status | running (unless-stopped) |
| Python | 3.12.13 |
| Spouštěcí cmd | `tail -f /dev/null` — container jen běží, skripty se spouštějí ručně |
| Working dir | `/scripts` |
| Vytvořen | 2026-06-02 |
---
## Tower — SSH přístup
| Parametr | Hodnota |
|----------|------------------|
| Host | tower / 192.168.1.76 |
| Port | 22 |
| User | root |
| Heslo | 7309208104 |
**Připojení přes Python (paramiko)** — Docker CLI není lokálně dostupný:
```python
import paramiko
c = paramiko.SSHClient()
c.set_missing_host_key_policy(paramiko.AutoAddPolicy())
c.connect('192.168.1.76', username='root', password='7309208104')
_, out, _ = c.exec_command('...')
print(out.read().decode())
c.close()
```
---
## Volume mounty
| Host (Unraid) | Kontejner | Popis |
|-----------------------|-------------------|------------------------------|
| `/mnt/user/Scripts` | `/scripts` | Skripty, logy — working dir |
| `/mnt/user/JNJEMAILS` | `/mnt/JNJEMAILS` | .msg soubory emailů (JNJ) |
---
## Spouštění skriptů
```bash
# Interaktivně (vidíš výstup):
docker exec -it python-runner python /scripts/parse_emails_tower_v1.1.py --limit 50 --no-indexes
# Na pozadí (log do souboru):
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.1.py > /scripts/parse_emails.log 2>&1"
# Pokračování po přerušení (skip hotových):
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.1.py --skip-existing > /scripts/parse_emails.log 2>&1"
# Sledování průběhu:
docker exec -it python-runner tail -f /scripts/parse_emails.log
```
---
## Aktuální skripty v /scripts
| Soubor | Popis |
|-------------------------------|------------------------------------------------|
| `parse_emails_tower_v1.1.py` | Import .msg → MongoDB (db: emaily, kolekce: vbuzalka@its.jnj.com) |
| `parse_emails_tower_v1.1.md` | Dokumentace ke skriptu |
| `parse_emails.log` | Log průběhu importu |
| `parse_emails_errors.log` | Log chyb (soubory které selhaly) |
Lokální protějšek: `EmailsImport/parse_emails_v1.0.py` — identický kód, liší se jen cestou
(`\\tower\JNJEMAILS` SMB vs. `/mnt/JNJEMAILS` lokální mount) a verzí hlavičky.
---
## Nainstalované Python balíčky
```
extract-msg 0.55.0
pymongo 4.17.0
python-dateutil 2.9.0.post0
cryptography 48.0.0
beautifulsoup4 4.13.5
oletools 0.60.2
msoffcrypto-tool 6.0.0
olefile 0.47
RTFDE 0.1.2.2
compressed-rtf 1.0.7
lark 1.3.1
pcodedmp 1.2.6
tzlocal 5.3.1
six 1.17.0
pip 25.0.1
```
---
## Přidání nového balíčku
```bash
docker exec python-runner pip install <balicek>
```
> Pozor: instalace se ztratí při recreate kontejneru — je třeba přidat do Dockerfile nebo do setup skriptu.
---
## Logika parse_emails (oba skripty)
- Čte všechny `.msg` soubory z MSGS_DIR
- Extrahuje: předmět, odesílatel, příjemci (To/CC/BCC), tělo (text+HTML), přílohy, internet headers, všechny raw MAPI properties
- Ukládá do MongoDB: `emaily``vbuzalka@its.jnj.com`
- `_id` = Internet Message-ID (nebo `filename:<stem>` jako fallback)
- Upsert → bezpečné opakování, `--skip-existing` pro pokračování
- Indexy: received_at, sent_at, sender.email, filename (unique), full-text (subject+body+to+cc)