This commit is contained in:
2026-06-10 11:59:19 +02:00
parent a41f97b86b
commit 7b2f69ad85
275 changed files with 16726 additions and 0 deletions
@@ -0,0 +1,209 @@
"Protocol","Country","Site","PI Name","Subject ID","Age at Informed Consent","Baseline Stool Count","Confirm Baseline Stool Count","Data Correction ID","Creation Date UTC","Status","Description","Date of Last Action UTC","Total Open Period","Total Open Time (Days)","Current Status Time (Days)","Type","Next Action Required","Category","Query History","Reason for Change","Resolution"
"77242113UCO3001","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","48","1","","SW00703544","13-May-2026","Submitted","Please change answer to clinical remision from no to YES (week 12). Entry erros ","20-May-2026","15-21 Days","19","14","Query Active ","Site","New","(1) 20 May 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification Request.
For us to process your request, please let us know the name of the form (with date) with question.
Thank you. ERT/CLARIO Data Coordination Team
","Entry Error",""
"77242113UCO3001","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","79","1","","SW00696586","09-Apr-2026","ReadyForQC","Please correct date of endoscopy to date: 18 March 2026 (from 25 March 2026)","15-Apr-2026","Over 28 Days","41","37","Query Active ","Site","Site-Entered Data","","Entry Error","CLARIO RESOLUTION:
Part 1: In Mayo Subscore (1) dated 08 Apr 2026 for I-0 visit, CLARIO to make the following changes:
- What was the date of endoscopy? (ENDODT1D): from 25 Mar 2026 to 18 Mar 2026
- Data Flag (QSDFLG1B): from blank to check
"
"77242113UCO3001","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","19","1","","SW00704536","19-May-2026","ReadyForQC","Please change the endoscopy date to 19-FEB-2026. 06-MAR-2026 was entered in error. ","26-May-2026","15-21 Days","15","10","Query Active ","Site","Site-Entered Data","","Entry Error","CLARIO RESOLUTION:
Part 1: In Mayo Subscore (1) dated 20 Mar 2026 for I-0 visit, CLARIO to make the following changes:
-What was the date of endoscopy? (ENDODT1D): from 06 Mar 2026 to 19 Feb 2026
- Data Flag (QSDFLG1B): from blank to check
"
"77242113UCO3001","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","22","5","Yes, I confirm this is the correct stool count.","SW00706684","01-Jun-2026","Submitted","The right endoscopy date is 23MAR2026, please change the date","05-Jun-2026","4-7 Days","7","2","Query Active ","Site","New","(1) 05 Jun 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Please confirm that if you are requesting following.
Mayo Subscore (1) dated 07 Apr 2026 for I-0
What was the date of endoscopy? (ENDODT1D): from 24 Mar 2026 to 23 Mar 2026
Thank you. ERT/CLARIO Data Coordination Team.
","Entry Error",""
"77242113UCO3001","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","29","1","","SW00705646","26-May-2026","ReadyForQC","Correct visit date I-O is 12-May-2026. All questionaries were filled on paper and entered in tablet later.
Log-in issue. ","09-Jun-2026","8-14 Days","10","1","","Clario DM","Visit Data","(1) 01 Jun 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Please provide the timestamps for each of the assessments if you used paper forms and transcribed into the device.
If unknown, ERT will use a dummy timestamp.
Thank you. ERT/CLARIO Data Coordination Team.
(2) 01 Jun 2026 dstepek@vnbrno.cz (Site User): time is unknown
","Changed Information","CLARIO RESOLUTION:
Part 1: In the following forms for I-0, CLARIO to make the following changes:
-Report Date: from 26May 2026 to 12 May 2026
-Report Start Date and time: from 26 May 2026 to 12 May 2026 23:59:59
-Event End Date: from 26 May 2026 08:27:57 to 12 May 2026 23:59:59
+Tablet Training Module (1)
+Participant Start Instructions (1)
+IBDQ (1)
+PROMIS Fatigue Short Form 7a (1)
+BASDAI (1)
+Participant End Instructions (1)
+Visit End (122)
"
"77242113UCO3001","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","49","1","","SW00708623","10-Jun-2026","Submitted","Correct date of I-2 is 26.5.2026. all questionaries were entered on paper at 07,45 and transmited later. ","10-Jun-2026","1 Day","1","","","Clario DM","New","","Changed Information",""
"77242113UCO3001","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","49","1","","SW00706581","29-May-2026","ReadyForQC","baseline stool count reported by subject is 0, please change to 1 as per CRA request (subject has 1 stool in 2-3 days if in remission)","05-Jun-2026","4-7 Days","7","3","","Clario DM","Demographic","","Changed Information","CLARIO RESOLUTION:
Part 1: In System Variables form, CLARIO to make the following changes:
- Baseline Stool Count (PT.Custom4): from 0 to 1
"
"77242113UCO3001","Czech Republic","DD5-CZ10016","Robert Mudr","CZ100162001","48","1","","SW00705916","27-May-2026","ReadyForQC","As per ATS investigation (ATS26040111), please remove the below form which was entered as a duplicate
- MAYO Diary (5) 24 Apr 2026","05-Jun-2026","8-14 Days","9","3","","Clario DM","Technical Revision","","Technical Revision - Other","CLARIO RESOLUTION:
Part 1: CLARIO to delete MAYO Diary (5) dated 24 Apr 2026
"
"77242113UCO3001","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","15","1","","SW00701729","06-May-2026","Completed","Dears, please delete data from visit I-0 (reported as 4th of May 2026) as this visit had to be postponed - see the previous DCR of this patient and change data request that was corrected. Patient has left the site before it was resolved and and new date of I-0 was planned. Patient continues to fill in his diary and patient is coming to I=0 visit within allowed window. We need the system and tablet to be ready to run new Mayo Score Report with updated and recent data (e.g. reflect new I-0 visit date, new eligible days -1 to -7.).
thank you, Jiri Skopek","19-May-2026","8-14 Days","8","","","","Visit Data","(1) 11 May 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Please note that the delete forms are allowed if the reason is one of the following.
If not, forms will move to unscheduled visit.
Data collected by the wrong patient.
Data collected by someone other than the patient.
Data collected prior to informed consent, or after withdrawal from the study.
Duplicate data erroneously entered at an Unscheduled visit via paper transcription.
Data collected that is not expected per protocol.
Also, I-0 visit is still ongoing. Please close the visit.
Once the visit was closed, we will process accoridngly.
Thank you. ERT/CLARIO Data Coordination Team
(2) 11 May 2026 jskopek (Site User): Dears,
I do not see any option that is adequate -from the list. Data are not needed to be deleted fully, they reflect the situation at May4th. Please mark it as unscheduled visit - as exactly that is the case. We need the system to be ready for I-0 visit planned for next week.
I will close the visit tomorrow - do you mean in tablet/ipad?
Thank you very much for your help! Jiri
(3) 12 May 2026 venkata.ramana (Clario): Thank you for your response.
Please note that the visit I-0 was still ongoing but not closed yet.
So please close the visit.
Kind Regards, Clario Data Coordination Team.
(4) 12 May 2026 jskopek (Site User): If I try to close the I-O visit in TABLET, it asks me if patient fulfils eligibility criteria to proceed to next visit based on these old data if I answer NO, it asks me to DEACTIVATE patient. I do not want to DEACTIVATE patient can you help WHERE and HOW to close this visit for you to change it to UNSCHEDULED and not to de-activate patient?
Thank you Jiri
","Other-delete visit I-0","CLARIO RESOLUTION:
Part 1: In the following forms dated 04 May 2026, CLARIO to make the following changes:
-Event ID: from I-0 to Unscheduled Visit 1
-Event At Entry: from I-0 to Unscheduled Visit 1
+Visit Start (49)
+ePRO Availability (1)
+Mayo Subscore (1)
+PGA (1)
Part 2: CLARIO to delete the following forms dated 04 May 2026 for I-0 visit.
+C-SSRS Since Last Visit (1)
+C-SSRS Since Last Visit Findings Report (1)
Part 3: CLARIO to manually enter Visit End form for Unscheduled visit 1 with the following information:
-Protocol: 77242113UCO3001
-Report Date: 04 May 2026
-Report Start Date and Time: 04 May 2026 23:59:59
-Event ID: Unscheduled Visit 1
-Event End Date: 04 May 2026 23:59:59
-Visit Status: Incomplete
-Phase At Entry: Screening
-Phase At Entry Timestamp: 13 Apr 2026 12:32:20
-Event At Entry: Unscheduled visit 1
-Event Start Date: 04 May 2026 23:59:59
-Event Time Zone Offset in Milliseconds: 7200000
-Session Repeat Number (SESREP1N): 0
-Session Instance Id (SESINST1S): 3f1214f0-4788-11f1-a0cf-bb403212adce
"
"77242113UCO3001","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","15","1","","SW00701226","04-May-2026","Completed","Dears, we would like ask you to change the information I read on assignment form given by patient on April 13, 2026 (Visit 1), Baseline Stool Count (PT.Custom4) as 3 that should be reported as 1.
Patient has entered wrong number as he did not understood it should be number of stools when illness is in remission or absent. He is a child and did not reflected this question correctly. Therefore, please change Baseline Stool Count = 1.
Thank you, Jiri Skopek ","04-May-2026","1 Day","1","","","","Demographic","","Changed Information","(Clario instructions)
1. Please make below changes in the assignment form:
Baseline Stool Count (PT. Custom4): 03 to 01."
"77242113UCO3001","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","61","1","","SW00699492","23-Apr-2026","ReadyForQC","Please correct the date of endoscopy done during screening visit of patient CZ100212001 to correct date 16-MAR-2026.","29-Apr-2026","Over 28 Days","32","28","Query Active ","Site","Site-Entered Data","","Changed Information","CLARIO RESOLUTION:
Part 1: In the Mayo Subscore (1) dated 07 Apr 2026 for I-0 visit, CLARIO to make the following changes:
-What was the date of endoscopy? (ENDODT1D): from 24 Mar 2026 to 16 Mar 2026
- Data Flag (QSDFLG1B): from blank to check
"
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","39","1","","SW00703322","12-May-2026","Completed","As per ATS investigation (ATS26040111), please remove the below form that's been entered as a duplicate
- MAYO Diary (16) - 18 Mar 2026
","20-May-2026","4-7 Days","6","","","","Technical Revision","","Technical Revision - Other","CLARIO RESOLUTION:
Part 1: CLARIO to delete the MAYO Diary (16) dated 18 Mar 2026.
"
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","39","1","","SW00689748","09-Mar-2026","Completed","Dear all,
Patient CZ 100222003 was randomized on 9 Mar 2026. Kindly correct the colonoscopy date to 11 Feb 2025.
The date was initially entered as 21 Feb 2025 because the earlier date could not be entered in the system. The patient was rescreened.","02-Apr-2026","15-21 Days","17","","","","Site-Entered Data","(1) 13 Mar 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Could you please conform that if you are requesting following?
Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit
-What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025
Could you please confirm the year? This subject was assigned on 02 Mar 2026, you are providing that correct date is 11 Feb 2025 which a year ago.
If you are not requesting above, please provide us the name of the form with question.
Thank you. ERT/CLARIO Data Coordination Team
(2) 13 Mar 2026 katerina.havlikova@clinoxus.com (Site User): confirm date of colonoscopy 11Feb2026
(3) 21 Mar 2026 msullivan (Clario): Dear Site,
The requested changes to the Mayo data have been updated. Please navigate to the Mayo Score Report and resubmit the form for visit to log the updated Mayo Score form. Once done, please respond to this query confirming that the Mayo Score has been resubmitted.
Thank you. ERT/CLARIO Data Coordination Team
(4) 24 Mar 2026 jana.pomahacova@clinoxus.com (Site User): Thank you and sent
","New Information","CLARIO RESOLUTION:
Part 1: In the Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit, CLARIO to make the following changes:
-What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025
-Data Flag (QSDFLG1B): from blank to check"
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","33","1","","SW00705372","22-May-2026","Submitted","Dear all, please change Colonoscopz date from 8April2026 to date 01Apr2026 Thank you in advance","02-Jun-2026","8-14 Days","12","5","","Clario DM","New","(1) 29 May 2026 msullivan (Clario): Please confirm your request
Dear Site. Thank you for submitting this Data Clarification.
Please provide us the name of the form for this request.
Thank you. ERT/CLARIO Data Coordination Team
(2) 02 Jun 2026 katerina.havlikova@clinoxus.com (Site User): Dear all, please change Colonoscopy for Week I-12 date from 8April2026 to date 01Apr2026 Thank you in advance
","Changed Information",""
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","33","1","","SW00702538","08-May-2026","Completed","This TRR is to document the correction to the Mayo Subscore (1) form, where the following variables were populated with NULL values, due to a known core defect:
Event At Entry, Event Start Date, Event Time Zone Offset in Milliseconds.","12-May-2026","2-3 Days","2","","","","Technical Revision","","Technical Revision - Other","Please make the below changes in Mayo Subscore (1) dated 22 Apr 2026:
-Event At Entry: I-0
-Event Start Date: 09 Apr 2026 08:09:19
-Event Time Zone Offset in Milliseconds: 7200000"
1 Protocol Country Site PI Name Subject ID Age at Informed Consent Baseline Stool Count Confirm Baseline Stool Count Data Correction ID Creation Date UTC Status Description Date of Last Action UTC Total Open Period Total Open Time (Days) Current Status Time (Days) Type Next Action Required Category Query History Reason for Change Resolution
2 77242113UCO3001 Czech Republic DD5-CZ10001 Matej Falc CZ100012001 48 1 SW00703544 13-May-2026 Submitted Please change answer to clinical remision from no to YES (week 12). Entry erros 20-May-2026 15-21 Days 19 14 Query Active Site New (1) 20 May 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification Request. For us to process your request, please let us know the name of the form (with date) with question. Thank you. ERT/CLARIO Data Coordination Team Entry Error
3 77242113UCO3001 Czech Republic DD5-CZ10001 Matej Falc CZ100012002 79 1 SW00696586 09-Apr-2026 ReadyForQC Please correct date of endoscopy to date: 18 March 2026 (from 25 March 2026) 15-Apr-2026 Over 28 Days 41 37 Query Active Site Site-Entered Data Entry Error CLARIO RESOLUTION: Part 1: In Mayo Subscore (1) dated 08 Apr 2026 for I-0 visit, CLARIO to make the following changes: - What was the date of endoscopy? (ENDODT1D): from 25 Mar 2026 to 18 Mar 2026 - Data Flag (QSDFLG1B): from blank to check
4 77242113UCO3001 Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 19 1 SW00704536 19-May-2026 ReadyForQC Please change the endoscopy date to 19-FEB-2026. 06-MAR-2026 was entered in error. 26-May-2026 15-21 Days 15 10 Query Active Site Site-Entered Data Entry Error CLARIO RESOLUTION: Part 1: In Mayo Subscore (1) dated 20 Mar 2026 for I-0 visit, CLARIO to make the following changes: -What was the date of endoscopy? (ENDODT1D): from 06 Mar 2026 to 19 Feb 2026 - Data Flag (QSDFLG1B): from blank to check
5 77242113UCO3001 Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 22 5 Yes, I confirm this is the correct stool count. SW00706684 01-Jun-2026 Submitted The right endoscopy date is 23MAR2026, please change the date 05-Jun-2026 4-7 Days 7 2 Query Active Site New (1) 05 Jun 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Please confirm that if you are requesting following. Mayo Subscore (1) dated 07 Apr 2026 for I-0 What was the date of endoscopy? (ENDODT1D): from 24 Mar 2026 to 23 Mar 2026 Thank you. ERT/CLARIO Data Coordination Team. Entry Error
6 77242113UCO3001 Czech Republic DD5-CZ10013 David Stepek CZ100132002 29 1 SW00705646 26-May-2026 ReadyForQC Correct visit date I-O is 12-May-2026. All questionaries were filled on paper and entered in tablet later. Log-in issue. 09-Jun-2026 8-14 Days 10 1 Clario DM Visit Data (1) 01 Jun 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Please provide the timestamps for each of the assessments if you used paper forms and transcribed into the device. If unknown, ERT will use a dummy timestamp. Thank you. ERT/CLARIO Data Coordination Team. (2) 01 Jun 2026 dstepek@vnbrno.cz (Site User): time is unknown Changed Information CLARIO RESOLUTION: Part 1: In the following forms for I-0, CLARIO to make the following changes: -Report Date: from 26May 2026 to 12 May 2026 -Report Start Date and time: from 26 May 2026 to 12 May 2026 23:59:59 -Event End Date: from 26 May 2026 08:27:57 to 12 May 2026 23:59:59 +Tablet Training Module (1) +Participant Start Instructions (1) +IBDQ (1) +PROMIS Fatigue – Short Form 7a (1) +BASDAI (1) +Participant End Instructions (1) +Visit End (122)
7 77242113UCO3001 Czech Republic DD5-CZ10013 David Stepek CZ100132003 49 1 SW00708623 10-Jun-2026 Submitted Correct date of I-2 is 26.5.2026. all questionaries were entered on paper at 07,45 and transmited later. 10-Jun-2026 1 Day 1 Clario DM New Changed Information
8 77242113UCO3001 Czech Republic DD5-CZ10013 David Stepek CZ100132003 49 1 SW00706581 29-May-2026 ReadyForQC baseline stool count reported by subject is 0, please change to 1 as per CRA request (subject has 1 stool in 2-3 days if in remission) 05-Jun-2026 4-7 Days 7 3 Clario DM Demographic Changed Information CLARIO RESOLUTION: Part 1: In System Variables form, CLARIO to make the following changes: - Baseline Stool Count (PT.Custom4): from 0 to 1
9 77242113UCO3001 Czech Republic DD5-CZ10016 Robert Mudr CZ100162001 48 1 SW00705916 27-May-2026 ReadyForQC As per ATS investigation (ATS26040111), please remove the below form which was entered as a duplicate - MAYO Diary (5) 24 Apr 2026 05-Jun-2026 8-14 Days 9 3 Clario DM Technical Revision Technical Revision - Other CLARIO RESOLUTION: Part 1: CLARIO to delete MAYO Diary (5) dated 24 Apr 2026
10 77242113UCO3001 Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 15 1 SW00701729 06-May-2026 Completed Dears, please delete data from visit I-0 (reported as 4th of May 2026) as this visit had to be postponed - see the previous DCR of this patient and change data request that was corrected. Patient has left the site before it was resolved and and new date of I-0 was planned. Patient continues to fill in his diary and patient is coming to I=0 visit within allowed window. We need the system and tablet to be ready to run new Mayo Score Report with updated and recent data (e.g. reflect new I-0 visit date, new eligible days -1 to -7.). thank you, Jiri Skopek 19-May-2026 8-14 Days 8 Visit Data (1) 11 May 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Please note that the delete forms are allowed if the reason is one of the following. If not, forms will move to unscheduled visit. Data collected by the wrong patient. Data collected by someone other than the patient. Data collected prior to informed consent, or after withdrawal from the study. Duplicate data erroneously entered at an Unscheduled visit via paper transcription. Data collected that is not expected per protocol. Also, I-0 visit is still ongoing. Please close the visit. Once the visit was closed, we will process accoridngly. Thank you. ERT/CLARIO Data Coordination Team (2) 11 May 2026 jskopek (Site User): Dears, I do not see any option that is adequate -from the list. Data are not needed to be deleted fully, they reflect the situation at May4th. Please mark it as unscheduled visit - as exactly that is the case. We need the system to be ready for I-0 visit planned for next week. I will close the visit tomorrow - do you mean in tablet/ipad? Thank you very much for your help! Jiri (3) 12 May 2026 venkata.ramana (Clario): Thank you for your response. Please note that the visit I-0 was still ongoing but not closed yet. So please close the visit. Kind Regards, Clario Data Coordination Team. (4) 12 May 2026 jskopek (Site User): If I try to close the I-O visit in TABLET, it asks me if patient fulfils eligibility criteria to proceed to next visit based on these old data – if I answer NO, it asks me to DEACTIVATE patient. I do not want to DEACTIVATE patient – can you help WHERE and HOW to close this visit for you to change it to UNSCHEDULED and not to de-activate patient? Thank you Jiri Other-delete visit I-0 CLARIO RESOLUTION: Part 1: In the following forms dated 04 May 2026, CLARIO to make the following changes: -Event ID: from I-0 to Unscheduled Visit 1 -Event At Entry: from I-0 to Unscheduled Visit 1 +Visit Start (49) +ePRO Availability (1) +Mayo Subscore (1) +PGA (1) Part 2: CLARIO to delete the following forms dated 04 May 2026 for I-0 visit. +C-SSRS Since Last Visit (1) +C-SSRS Since Last Visit Findings Report (1) Part 3: CLARIO to manually enter Visit End form for Unscheduled visit 1 with the following information: -Protocol: 77242113UCO3001 -Report Date: 04 May 2026 -Report Start Date and Time: 04 May 2026 23:59:59 -Event ID: Unscheduled Visit 1 -Event End Date: 04 May 2026 23:59:59 -Visit Status: Incomplete -Phase At Entry: Screening -Phase At Entry Timestamp: 13 Apr 2026 12:32:20 -Event At Entry: Unscheduled visit 1 -Event Start Date: 04 May 2026 23:59:59 -Event Time Zone Offset in Milliseconds: 7200000 -Session Repeat Number (SESREP1N): 0 -Session Instance Id (SESINST1S): 3f1214f0-4788-11f1-a0cf-bb403212adce
11 77242113UCO3001 Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 15 1 SW00701226 04-May-2026 Completed Dears, we would like ask you to change the information I read on assignment form given by patient on April 13, 2026 (Visit 1), Baseline Stool Count (PT.Custom4) as 3 that should be reported as 1. Patient has entered wrong number as he did not understood it should be number of stools when illness is in remission or absent. He is a child and did not reflected this question correctly. Therefore, please change Baseline Stool Count = 1. Thank you, Jiri Skopek 04-May-2026 1 Day 1 Demographic Changed Information (Clario instructions) 1. Please make below changes in the assignment form: Baseline Stool Count (PT. Custom4): 03 to 01.
12 77242113UCO3001 Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 61 1 SW00699492 23-Apr-2026 ReadyForQC Please correct the date of endoscopy done during screening visit of patient CZ100212001 to correct date 16-MAR-2026. 29-Apr-2026 Over 28 Days 32 28 Query Active Site Site-Entered Data Changed Information CLARIO RESOLUTION: Part 1: In the Mayo Subscore (1) dated 07 Apr 2026 for I-0 visit, CLARIO to make the following changes: -What was the date of endoscopy? (ENDODT1D): from 24 Mar 2026 to 16 Mar 2026 - Data Flag (QSDFLG1B): from blank to check
13 77242113UCO3001 Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 39 1 SW00703322 12-May-2026 Completed As per ATS investigation (ATS26040111), please remove the below form that's been entered as a duplicate - MAYO Diary (16) - 18 Mar 2026 20-May-2026 4-7 Days 6 Technical Revision Technical Revision - Other CLARIO RESOLUTION: Part 1: CLARIO to delete the MAYO Diary (16) dated 18 Mar 2026.
14 77242113UCO3001 Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 39 1 SW00689748 09-Mar-2026 Completed Dear all, Patient CZ 100222003 was randomized on 9 Mar 2026. Kindly correct the colonoscopy date to 11 Feb 2025. The date was initially entered as 21 Feb 2025 because the earlier date could not be entered in the system. The patient was rescreened. 02-Apr-2026 15-21 Days 17 Site-Entered Data (1) 13 Mar 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Could you please conform that if you are requesting following? Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit -What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025 Could you please confirm the year? This subject was assigned on 02 Mar 2026, you are providing that correct date is 11 Feb 2025 which a year ago. If you are not requesting above, please provide us the name of the form with question. Thank you. ERT/CLARIO Data Coordination Team (2) 13 Mar 2026 katerina.havlikova@clinoxus.com (Site User): confirm date of colonoscopy 11Feb2026 (3) 21 Mar 2026 msullivan (Clario): Dear Site, The requested changes to the Mayo data have been updated. Please navigate to the Mayo Score Report and resubmit the form for visit to log the updated Mayo Score form. Once done, please respond to this query confirming that the Mayo Score has been resubmitted. Thank you. ERT/CLARIO Data Coordination Team (4) 24 Mar 2026 jana.pomahacova@clinoxus.com (Site User): Thank you and sent New Information CLARIO RESOLUTION: Part 1: In the Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit, CLARIO to make the following changes: -What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025 -Data Flag (QSDFLG1B): from blank to check
15 77242113UCO3001 Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 33 1 SW00705372 22-May-2026 Submitted Dear all, please change Colonoscopz date from 8April2026 to date 01Apr2026 Thank you in advance 02-Jun-2026 8-14 Days 12 5 Clario DM New (1) 29 May 2026 msullivan (Clario): Please confirm your request Dear Site. Thank you for submitting this Data Clarification. Please provide us the name of the form for this request. Thank you. ERT/CLARIO Data Coordination Team (2) 02 Jun 2026 katerina.havlikova@clinoxus.com (Site User): Dear all, please change Colonoscopy for Week I-12 date from 8April2026 to date 01Apr2026 Thank you in advance Changed Information
16 77242113UCO3001 Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 33 1 SW00702538 08-May-2026 Completed This TRR is to document the correction to the Mayo Subscore (1) form, where the following variables were populated with NULL values, due to a known core defect: Event At Entry, Event Start Date, Event Time Zone Offset in Milliseconds. 12-May-2026 2-3 Days 2 Technical Revision Technical Revision - Other Please make the below changes in Mayo Subscore (1) dated 22 Apr 2026: -Event At Entry: I-0 -Event Start Date: 09 Apr 2026 08:09:19 -Event Time Zone Offset in Milliseconds: 7200000
@@ -0,0 +1,53 @@
"Protocol","Study Population","Country","Site","Principal Investigator","Participant ID","Baseline Stool Frequency","Visit","Visit Date","Endoscopy Completed?","Endoscopy Date","Bowel Preparation Start Date 1","Bowel Preparation End Date 1","Bowel Preparation Start Date 2","Bowel Preparation End Date 2","Central Endoscopy Score","Local Endoscopy Score","PGA Score","Eligible Day (-1)","Day (-1) Excluded Reason(s)","Eligible Day (-2)","Day (-2) Excluded Reason(s)","Eligible Day (-3)","Day (-3) Excluded Reason(s)","Eligible Day (-4)","Day (-4) Excluded Reason(s)","Eligible Day (-5)","Day (-5) Excluded Reason(s)","Eligible Day (-6)","Day (-6) Excluded Reason(s)","Eligible Day (-7)","Day (-7) Excluded Reason(s)","Eligible Day (-8)","Day (-8) Excluded Reason(s)","Eligible Day (-9)","Day (-9) Excluded Reason(s)","Eligible Day (-10)","Day (-10) Excluded Reason(s)","Eligible Day (-1) Stool Count","Eligible Day (-2) Stool Count","Eligible Day (-3) Stool Count","Eligible Day (-4) Stool Count","Eligible Day (-5) Stool Count","Eligible Day (-6) Stool Count","Eligible Day (-7) Stool Count","Eligible Day (-8) Stool Count","Eligible Day (-9) Stool Count","Eligible Day (-10) Stool Count","Stool Frequency Sub-score","Eligible Day (-1) Rectal Bleeding Score","Eligible Day (-2) Rectal Bleeding Score","Eligible Day (-3) Rectal Bleeding Score","Eligible Day (-4) Rectal Bleeding Score","Eligible Day (-5) Rectal Bleeding Score","Eligible Day (-6) Rectal Bleeding Score","Eligible Day (-7) Rectal Bleeding Score","Eligible Day (-8) Rectal Bleeding Score","Eligible Day (-9) Rectal Bleeding Score","Eligible Day (-10) Rectal Bleeding Score","Rectal Bleeding Sub-score","Partial Mayo Score","Modified Mayo Score","Full Mayo Score","Site Action","Last Mayo Score Submission","Week I-12 Clinical Responder","Week I-12 Clinical Remission","Clinical Flare","Loss of Response","Partial Mayo Response Post Loss of Response","Partial Mayo Response for Clinical Non-Responders"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-0","19 Feb 2026","Yes","05 Feb 2026","04 Feb 2026","04 Feb 2026","-","-","2","-","3","18 Feb 2026","-","17 Feb 2026","-","16 Feb 2026","-","15 Feb 2026","-","14 Feb 2026","-","13 Feb 2026","-","12 Feb 2026","-","11 Feb 2026","Day Not Applicable for Calculation","10 Feb 2026","Day Not Applicable for Calculation","09 Feb 2026","Day Not Applicable for Calculation","10","8","7","5","7","8","8","-","-","-","3","1","1","1","0","1","1","1","-","-","-","1","7","6","9","-","08 Apr 2026 07:11:25","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-2","04 Mar 2026","-","-","-","-","-","-","-","-","3","03 Mar 2026","-","02 Mar 2026","-","01 Mar 2026","-","28 Feb 2026","-","27 Feb 2026","-","26 Feb 2026","-","25 Feb 2026","-","24 Feb 2026","Day Not Applicable for Calculation","23 Feb 2026","Day Not Applicable for Calculation","22 Feb 2026","Day Not Applicable for Calculation","5","4","5","4","5","6","6","-","-","-","2","1","0","1","0","1","0","1","-","-","-","1","6","","","-","28 May 2026 10:04:05","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-4","18 Mar 2026","-","-","-","-","-","-","-","-","2","17 Mar 2026","-","16 Mar 2026","-","15 Mar 2026","-","14 Mar 2026","-","13 Mar 2026","-","12 Mar 2026","-","11 Mar 2026","-","10 Mar 2026","Day Not Applicable for Calculation","09 Mar 2026","Day Not Applicable for Calculation","08 Mar 2026","Day Not Applicable for Calculation","5","5","5","4","5","4","5","-","-","-","2","1","0","0","1","1","1","0","-","-","-","1","5","","","-","08 Apr 2026 11:04:49","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-8","05 May 2026","-","-","-","-","-","-","-","-","1","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","3","3","4","4","5","4","4","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","4","","","-","28 May 2026 14:42:53","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-12","13 May 2026","Yes","06 May 2026","05 May 2026","05 May 2026","-","-","1","-","1","12 May 2026","-","11 May 2026","-","10 May 2026","-","09 May 2026","-","08 May 2026","-","07 May 2026","-","06 May 2026","Endoscopy","05 May 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","04 May 2026","-","03 May 2026","Day Not Applicable for Calculation","5","4","6","5","5","5","-","-","3","-","2","1","0","1","1","1","1","-","-","1","-","1","4","4","5","-","10 Jun 2026 07:16:05","Clinical Responder","No","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","M-4","10 Jun 2026","-","-","-","-","-","-","-","-","1","09 Jun 2026","-","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","Day Not Applicable for Calculation","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","4","5","3","4","5","4","5","-","-","-","2","0","0","0","0","1","0","1","-","-","-","0","3","","","-","10 Jun 2026 07:15:50","N/A","N/A","No","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-0","08 Apr 2026","Yes","18 Mar 2026","17 Mar 2026","18 Mar 2026","-","-","2","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","Missing Diary","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","3","3","4","-","3","3","4","-","-","-","1","0","0","0","-","0","0","1","-","-","-","0","3","3","5","-","10 Jun 2026 08:42:08","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-2","23 Apr 2026","-","-","-","-","-","-","-","-","2","22 Apr 2026","Missing Diary","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","Day Not Applicable for Calculation","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","-","3","3","6","5","5","4","-","-","-","2","-","0","0","1","1","1","1","-","-","-","1","5","","","-","10 Jun 2026 08:42:33","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-4","06 May 2026","-","-","-","-","-","-","-","-","1","05 May 2026","-","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","Day Not Applicable for Calculation","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","6","3","2","3","3","3","3","-","-","-","1","1","0","0","0","1","1","0","-","-","-","0","2","","","-","28 May 2026 14:43:38","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-8","04 Jun 2026","-","-","-","-","-","-","-","-","1","03 Jun 2026","-","02 Jun 2026","-","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","Day Not Applicable for Calculation","26 May 2026","Day Not Applicable for Calculation","25 May 2026","Day Not Applicable for Calculation","3","4","3","3","3","3","4","-","-","-","1","0","0","0","0","0","0","1","-","-","-","0","2","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012003","1","I-0","27 May 2026","Yes","13 May 2026","12 May 2026","12 May 2026","-","-","3","-","2","26 May 2026","-","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","Day Not Applicable for Calculation","18 May 2026","Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","6","9","7","8","9","7","8","-","-","-","3","2","2","2","2","1","1","1","-","-","-","2","7","8","10","-","27 May 2026 07:24:39","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012003","1","I-2","10 Jun 2026","-","-","-","-","-","-","-","-","2","09 Jun 2026","-","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","Day Not Applicable for Calculation","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","7","8","8","7","6","8","6","-","-","-","3","2","2","1","2","2","2","1","-","-","-","2","7","","","-","10 Jun 2026 07:30:18","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10003","Leksa Vaclav","CZ100032001","2","I-0","10 Jun 2026","Yes","27 May 2026","26 May 2026","26 May 2026","-","-","2","-","2","09 Jun 2026","Missing Diary","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","Day Not Applicable for Calculation","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","-","4","4","4","5","4","5","-","-","-","1","-","2","2","2","2","2","2","-","-","-","2","5","5","7","-","10 Jun 2026 08:48:09","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-0","20 Mar 2026","Yes","19 Feb 2026","-","-","-","-","3","-","3","19 Mar 2026","-","18 Mar 2026","-","17 Mar 2026","-","16 Mar 2026","-","15 Mar 2026","-","14 Mar 2026","-","13 Mar 2026","-","12 Mar 2026","Day Not Applicable for Calculation","11 Mar 2026","Day Not Applicable for Calculation","10 Mar 2026","Day Not Applicable for Calculation","7","7","8","8","7","8","5","-","-","-","3","2","1","1","1","1","1","0","-","-","-","1","7","7","10","-","20 Mar 2026 07:03:23","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-2","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","Medication For Diarrhea","06 Apr 2026","Medication For Diarrhea","05 Apr 2026","Medication For Diarrhea","04 Apr 2026","Medication For Diarrhea","03 Apr 2026","Medication For Diarrhea","02 Apr 2026","Medication For Diarrhea","01 Apr 2026","Medication For Diarrhea","31 Mar 2026","Medication For Diarrhea;Day Not Applicable for Calculation","30 Mar 2026","Medication For Diarrhea;Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","-","-","-","-","-","-","-","-","-","-","Non-Evaluable","-","-","-","-","-","-","-","-","-","-","Non-Evaluable","Non-Evaluable","Non-Evaluable","Non-Evaluable","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-4","15 Apr 2026","-","-","-","-","-","-","-","-","3","14 Apr 2026","-","13 Apr 2026","-","12 Apr 2026","-","11 Apr 2026","-","10 Apr 2026","-","09 Apr 2026","-","08 Apr 2026","-","07 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","06 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","05 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","9","22","20","19","17","18","18","-","-","-","3","1","3","2","2","2","2","2","-","-","-","2","8","","","-","04 May 2026 22:06:03","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-8","18 May 2026","-","-","-","-","-","-","-","-","2","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","-","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","08 May 2026","Day Not Applicable for Calculation","7","5","9","7","7","8","8","-","-","-","3","1","1","1","1","1","1","1","-","-","-","1","6","","","-","04 Jun 2026 21:46:30","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-12","08 Jun 2026","Yes","28 May 2026","-","-","-","-","3","-","3","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","-","01 Jun 2026","Missing Diary","31 May 2026","Day Not Applicable for Calculation","30 May 2026","Day Not Applicable for Calculation","29 May 2026","Day Not Applicable for Calculation","6","5","5","5","7","6","-","-","-","-","3","1","1","0","0","1","0","-","-","-","-","1","7","7","10","-","-","Clinical Nonresponder","No","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062002","1","I-0","26 May 2026","Yes","14 May 2026","13 May 2026","13 May 2026","-","-","2","-","2","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","-","18 May 2026","Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","16 May 2026","Day Not Applicable for Calculation","8","8","6","7","7","6","7","-","-","-","3","2","2","2","2","2","2","2","-","-","-","2","7","7","9","-","29 May 2026 15:45:00","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062002","1","I-2","09 Jun 2026","-","-","-","-","-","-","-","-","2","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","-","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","30 May 2026","Day Not Applicable for Calculation","7","8","7","7","7","5","7","-","-","-","3","2","1","1","1","2","2","2","-","-","-","2","7","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10009","Jiri Pumprla","CZ100092001","1","I-0","05 May 2026","Yes","24 Apr 2026","23 Apr 2026","23 Apr 2026","-","-","2","-","2","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","5","5","5","5","5","5","5","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","5","5","7","-","05 May 2026 11:19:40","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10009","Jiri Pumprla","CZ100092001","1","I-2","19 May 2026","-","-","-","-","-","-","-","-","1","18 May 2026","-","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","Day Not Applicable for Calculation","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","5","4","5","5","5","4","6","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","4","","","-","19 May 2026 10:38:25","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10009","Jiri Pumprla","CZ100092001","1","I-4","04 Jun 2026","-","-","-","-","-","-","-","-","1","03 Jun 2026","-","02 Jun 2026","-","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","Day Not Applicable for Calculation","26 May 2026","Day Not Applicable for Calculation","25 May 2026","Day Not Applicable for Calculation","2","3","2","3","3","2","3","-","-","-","1","0","0","0","0","0","0","0","-","-","-","0","2","","","-","04 Jun 2026 09:24:54","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-0","07 Apr 2026","Yes","24 Mar 2026","22 Mar 2026","22 Mar 2026","-","-","2","-","2","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","-","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","28 Mar 2026","Day Not Applicable for Calculation","8","11","5","9","11","10","13","-","-","-","3","1","2","2","2","2","2","2","-","-","-","2","7","7","9","-","04 May 2026 08:44:52","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-2","22 Apr 2026","-","-","-","-","-","-","-","-","2","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","7","5","6","6","7","8","2","-","-","-","1","1","0","1","1","1","2","0","-","-","-","1","4","","","-","04 May 2026 08:45:07","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-4","07 May 2026","-","-","-","-","-","-","-","-","1","06 May 2026","-","05 May 2026","-","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","Day Not Applicable for Calculation","28 Apr 2026","Day Not Applicable for Calculation","27 Apr 2026","Day Not Applicable for Calculation","8","7","7","8","4","11","7","-","-","-","1","2","1","1","1","0","1","1","-","-","-","1","3","","","-","01 Jun 2026 00:57:35","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-8","03 Jun 2026","-","-","-","-","-","-","-","-","2","02 Jun 2026","-","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","Day Not Applicable for Calculation","25 May 2026","Day Not Applicable for Calculation","24 May 2026","Day Not Applicable for Calculation","5","9","7","5","5","9","7","-","-","-","1","1","1","1","0","3","0","1","-","-","-","1","4","","","-","03 Jun 2026 17:47:25","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-0","24 Mar 2026","Yes","12 Mar 2026","11 Mar 2026","11 Mar 2026","-","-","2","-","2","23 Mar 2026","-","22 Mar 2026","-","21 Mar 2026","-","20 Mar 2026","-","19 Mar 2026","-","18 Mar 2026","-","17 Mar 2026","-","16 Mar 2026","Day Not Applicable for Calculation","15 Mar 2026","Day Not Applicable for Calculation","14 Mar 2026","Day Not Applicable for Calculation","8","6","5","7","6","7","6","-","-","-","3","1","1","1","0","1","1","1","-","-","-","1","6","6","8","-","05 Apr 2026 22:41:27","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-2","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","5","2","3","6","5","5","5","-","-","-","2","0","0","0","0","1","1","0","-","-","-","0","4","","","-","28 May 2026 23:19:03","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-4","21 Apr 2026","-","-","-","-","-","-","-","-","0","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","-","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","11 Apr 2026","Day Not Applicable for Calculation","4","3","4","3","3","4","4","-","-","-","2","0","0","0","0","0","0","0","-","-","-","0","2","","","-","27 May 2026 12:54:41","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","1","I-0","12 May 2026","Yes","21 Apr 2026","20 Apr 2026","21 Apr 2026","-","-","2","-","2","11 May 2026","-","10 May 2026","-","09 May 2026","-","08 May 2026","-","07 May 2026","-","06 May 2026","-","05 May 2026","Missing Diary","04 May 2026","Day Not Applicable for Calculation","03 May 2026","Day Not Applicable for Calculation","02 May 2026","Day Not Applicable for Calculation","2","1","1","1","1","2","-","-","-","-","0","0","0","0","0","0","0","-","-","-","-","0","2","2","4","-","28 May 2026 23:19:30","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","1","I-2","26 May 2026","-","-","-","-","-","-","-","-","1","25 May 2026","-","24 May 2026","Missing Diary","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","-","18 May 2026","Missing Diary;Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","16 May 2026","Day Not Applicable for Calculation","1","-","1","2","1","2","2","-","-","-","1","0","-","0","0","0","0","0","-","-","-","0","2","","","-","28 May 2026 23:19:51","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","1","I-0","02 Jun 2026","Yes","25 May 2026","24 May 2026","24 May 2026","-","-","2","-","2","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","Endoscopy;Missing Diary;Day Not Applicable for Calculation","24 May 2026","Bowel Preparation for Procedure;Missing Diary;Day Not Applicable for Calculation","23 May 2026","Missing Diary;Day Not Applicable for Calculation","8","8","11","10","10","11","6","-","-","-","3","2","2","1","2","1","2","2","-","-","-","2","7","7","9","-","02 Jun 2026 08:17:40","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","1","I-2","10 Jun 2026","-","-","-","-","-","-","-","-","2","09 Jun 2026","-","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","Day Not Applicable for Calculation","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","9","2","1","4","2","4","2","-","-","-","1","1","1","0","1","1","1","0","-","-","-","1","4","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10016","Robert Mudr","CZ100162001","1","I-0","28 May 2026","Yes","19 May 2026","18 May 2026","19 May 2026","-","-","3","-","3","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","Day Not Applicable for Calculation","19 May 2026","Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation","18 May 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","14","15","15","15","15","15","15","-","-","-","3","2","3","3","2","2","3","3","-","-","-","3","9","9","12","-","28 May 2026 10:19:28","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","Unscheduled 1","04 May 2026","Yes","20 Apr 2026","12 Apr 2026","15 Apr 2026","-","-","2","-","3","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","-","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","24 Apr 2026","Day Not Applicable for Calculation","5","6","6","7","6","3","3","-","-","-","2","0","0","0","0","0","0","0","-","-","-","0","5","4","7","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","I-0","18 May 2026","Yes","01 May 2026","01 May 2026","01 May 2026","-","-","2","-","3","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","-","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","08 May 2026","Day Not Applicable for Calculation","6","6","6","6","6","6","6","-","-","-","3","0","0","0","0","0","0","0","-","-","-","0","6","5","8","-","18 May 2026 08:39:27","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","I-2","01 Jun 2026","-","-","-","-","-","-","-","-","3","31 May 2026","-","30 May 2026","Missing Diary","29 May 2026","Missing Diary","28 May 2026","Missing Diary","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","22 May 2026","Day Not Applicable for Calculation","6","-","-","-","6","6","6","-","-","-","3","0","-","-","-","0","0","0","-","-","-","0","6","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-0","07 Apr 2026","Yes","16 Mar 2026","15 Mar 2026","16 Mar 2026","-","-","3","-","3","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","-","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","28 Mar 2026","Day Not Applicable for Calculation","11","11","10","11","11","10","9","-","-","-","3","2","2","2","2","2","2","2","-","-","-","2","8","8","11","-","20 Apr 2026 09:27:58","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-2","20 Apr 2026","-","-","-","-","-","-","-","-","3","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","-","13 Apr 2026","-","12 Apr 2026","Day Not Applicable for Calculation","11 Apr 2026","Day Not Applicable for Calculation","10 Apr 2026","Day Not Applicable for Calculation","8","7","9","8","8","7","8","-","-","-","3","2","2","1","1","1","2","1","-","-","-","1","7","","","-","20 Apr 2026 09:29:01","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-4","05 May 2026","-","-","-","-","-","-","-","-","1","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","6","6","6","6","7","7","6","-","-","-","3","0","0","1","1","1","1","1","-","-","-","1","5","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-8","02 Jun 2026","-","-","-","-","-","-","-","-","1","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","Day Not Applicable for Calculation","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","3","4","4","4","5","5","5","-","-","-","2","0","0","0","0","0","1","1","-","-","-","0","3","","","-","02 Jun 2026 14:44:34","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222002","1","I-0","19 Feb 2026","Yes","11 Feb 2026","10 Feb 2026","11 Feb 2026","-","-","2","-","2","18 Feb 2026","-","17 Feb 2026","-","16 Feb 2026","-","15 Feb 2026","-","14 Feb 2026","-","13 Feb 2026","-","12 Feb 2026","-","11 Feb 2026","Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation","10 Feb 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","09 Feb 2026","Day Not Applicable for Calculation","3","2","2","3","4","3","2","-","-","-","1","1","1","0","0","0","2","2","-","-","-","1","4","4","6","-","19 Feb 2026 15:37:49","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-0","09 Mar 2026","Yes","11 Feb 2026","10 Feb 2026","11 Feb 2026","-","-","2","-","2","08 Mar 2026","-","07 Mar 2026","-","06 Mar 2026","-","05 Mar 2026","-","04 Mar 2026","-","03 Mar 2026","Missing Diary","02 Mar 2026","Missing Diary","01 Mar 2026","Missing Diary;Day Not Applicable for Calculation","28 Feb 2026","Missing Diary;Day Not Applicable for Calculation","27 Feb 2026","Missing Diary;Day Not Applicable for Calculation","7","7","6","6","7","-","-","-","-","-","3","2","2","2","2","2","-","-","-","-","-","2","7","7","9","-","24 Mar 2026 14:23:10","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-2","27 Mar 2026","-","-","-","-","-","-","-","-","2","26 Mar 2026","-","25 Mar 2026","-","24 Mar 2026","-","23 Mar 2026","-","22 Mar 2026","-","21 Mar 2026","-","20 Mar 2026","-","19 Mar 2026","Day Not Applicable for Calculation","18 Mar 2026","Day Not Applicable for Calculation","17 Mar 2026","Day Not Applicable for Calculation","7","3","3","3","5","5","5","-","-","-","2","0","0","1","1","1","1","2","-","-","-","1","5","","","-","08 Apr 2026 07:36:56","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-4","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","3","3","4","4","5","4","3","-","-","-","2","1","0","0","2","1","1","2","-","-","-","1","5","","","-","08 Apr 2026 07:59:35","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-8","04 May 2026","-","-","-","-","-","-","-","-","2","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","-","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","24 Apr 2026","Missing Diary;Day Not Applicable for Calculation","3","5","3","3","3","2","3","-","-","-","1","0","0","0","0","0","0","0","-","-","-","0","3","","","-","04 May 2026 08:08:40","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-12","01 Jun 2026","Yes","20 May 2026","19 May 2026","20 May 2026","-","-","3","-","2","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","22 May 2026","Day Not Applicable for Calculation","4","4","6","3","3","3","3","-","-","-","2","1","1","2","1","1","1","2","-","-","-","1","5","6","8","-","01 Jun 2026 14:25:57","Clinical Nonresponder","No","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-0","09 Apr 2026","Yes","08 Apr 2026","31 Mar 2026","01 Apr 2026","-","-","2","-","2","08 Apr 2026","Endoscopy","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","31 Mar 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","30 Mar 2026","-","-","3","3","4","3","4","3","-","-","3","1","-","2","2","2","2","2","2","-","-","2","2","5","5","7","-","29 May 2026 11:07:08","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-2","22 Apr 2026","-","-","-","-","-","-","-","-","2","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","3","3","5","3","2","3","2","-","-","-","1","1","2","2","1","1","1","2","-","-","-","1","4","","","-","05 May 2026 07:29:35","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-4","05 May 2026","-","-","-","-","-","-","-","-","2","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","4","2","2","2","2","2","2","-","-","-","1","1","1","1","1","2","1","1","-","-","-","1","4","","","-","05 May 2026 07:28:55","N/A","N/A","N/A","N/A","N/A","N/A"
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-8","02 Jun 2026","-","-","-","-","-","-","-","-","2","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","Day Not Applicable for Calculation","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","2","2","2","2","2","4","10","-","-","-","1","2","1","2","1","2","2","2","-","-","-","2","5","","","-","02 Jun 2026 08:18:08","N/A","N/A","N/A","N/A","N/A","N/A"
1 Protocol Study Population Country Site Principal Investigator Participant ID Baseline Stool Frequency Visit Visit Date Endoscopy Completed? Endoscopy Date Bowel Preparation Start Date 1 Bowel Preparation End Date 1 Bowel Preparation Start Date 2 Bowel Preparation End Date 2 Central Endoscopy Score Local Endoscopy Score PGA Score Eligible Day (-1) Day (-1) Excluded Reason(s) Eligible Day (-2) Day (-2) Excluded Reason(s) Eligible Day (-3) Day (-3) Excluded Reason(s) Eligible Day (-4) Day (-4) Excluded Reason(s) Eligible Day (-5) Day (-5) Excluded Reason(s) Eligible Day (-6) Day (-6) Excluded Reason(s) Eligible Day (-7) Day (-7) Excluded Reason(s) Eligible Day (-8) Day (-8) Excluded Reason(s) Eligible Day (-9) Day (-9) Excluded Reason(s) Eligible Day (-10) Day (-10) Excluded Reason(s) Eligible Day (-1) Stool Count Eligible Day (-2) Stool Count Eligible Day (-3) Stool Count Eligible Day (-4) Stool Count Eligible Day (-5) Stool Count Eligible Day (-6) Stool Count Eligible Day (-7) Stool Count Eligible Day (-8) Stool Count Eligible Day (-9) Stool Count Eligible Day (-10) Stool Count Stool Frequency Sub-score Eligible Day (-1) Rectal Bleeding Score Eligible Day (-2) Rectal Bleeding Score Eligible Day (-3) Rectal Bleeding Score Eligible Day (-4) Rectal Bleeding Score Eligible Day (-5) Rectal Bleeding Score Eligible Day (-6) Rectal Bleeding Score Eligible Day (-7) Rectal Bleeding Score Eligible Day (-8) Rectal Bleeding Score Eligible Day (-9) Rectal Bleeding Score Eligible Day (-10) Rectal Bleeding Score Rectal Bleeding Sub-score Partial Mayo Score Modified Mayo Score Full Mayo Score Site Action Last Mayo Score Submission Week I-12 Clinical Responder Week I-12 Clinical Remission Clinical Flare Loss of Response Partial Mayo Response Post Loss of Response Partial Mayo Response for Clinical Non-Responders
2 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-0 19 Feb 2026 Yes 05 Feb 2026 04 Feb 2026 04 Feb 2026 - - 2 - 3 18 Feb 2026 - 17 Feb 2026 - 16 Feb 2026 - 15 Feb 2026 - 14 Feb 2026 - 13 Feb 2026 - 12 Feb 2026 - 11 Feb 2026 Day Not Applicable for Calculation 10 Feb 2026 Day Not Applicable for Calculation 09 Feb 2026 Day Not Applicable for Calculation 10 8 7 5 7 8 8 - - - 3 1 1 1 0 1 1 1 - - - 1 7 6 9 - 08 Apr 2026 07:11:25 N/A N/A N/A N/A N/A N/A
3 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-2 04 Mar 2026 - - - - - - - - 3 03 Mar 2026 - 02 Mar 2026 - 01 Mar 2026 - 28 Feb 2026 - 27 Feb 2026 - 26 Feb 2026 - 25 Feb 2026 - 24 Feb 2026 Day Not Applicable for Calculation 23 Feb 2026 Day Not Applicable for Calculation 22 Feb 2026 Day Not Applicable for Calculation 5 4 5 4 5 6 6 - - - 2 1 0 1 0 1 0 1 - - - 1 6 - 28 May 2026 10:04:05 N/A N/A N/A N/A N/A N/A
4 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-4 18 Mar 2026 - - - - - - - - 2 17 Mar 2026 - 16 Mar 2026 - 15 Mar 2026 - 14 Mar 2026 - 13 Mar 2026 - 12 Mar 2026 - 11 Mar 2026 - 10 Mar 2026 Day Not Applicable for Calculation 09 Mar 2026 Day Not Applicable for Calculation 08 Mar 2026 Day Not Applicable for Calculation 5 5 5 4 5 4 5 - - - 2 1 0 0 1 1 1 0 - - - 1 5 - 08 Apr 2026 11:04:49 N/A N/A N/A N/A N/A N/A
5 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-8 05 May 2026 - - - - - - - - 1 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 3 3 4 4 5 4 4 - - - 2 1 1 1 1 1 1 1 - - - 1 4 - 28 May 2026 14:42:53 N/A N/A N/A N/A N/A N/A
6 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 I-12 13 May 2026 Yes 06 May 2026 05 May 2026 05 May 2026 - - 1 - 1 12 May 2026 - 11 May 2026 - 10 May 2026 - 09 May 2026 - 08 May 2026 - 07 May 2026 - 06 May 2026 Endoscopy 05 May 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 04 May 2026 - 03 May 2026 Day Not Applicable for Calculation 5 4 6 5 5 5 - - 3 - 2 1 0 1 1 1 1 - - 1 - 1 4 4 5 - 10 Jun 2026 07:16:05 Clinical Responder No N/A N/A N/A N/A
7 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012001 1 M-4 10 Jun 2026 - - - - - - - - 1 09 Jun 2026 - 08 Jun 2026 - 07 Jun 2026 - 06 Jun 2026 - 05 Jun 2026 - 04 Jun 2026 - 03 Jun 2026 - 02 Jun 2026 Day Not Applicable for Calculation 01 Jun 2026 Day Not Applicable for Calculation 31 May 2026 Day Not Applicable for Calculation 4 5 3 4 5 4 5 - - - 2 0 0 0 0 1 0 1 - - - 0 3 - 10 Jun 2026 07:15:50 N/A N/A No N/A N/A N/A
8 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012002 1 I-0 08 Apr 2026 Yes 18 Mar 2026 17 Mar 2026 18 Mar 2026 - - 2 - 2 07 Apr 2026 - 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 Missing Diary 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 Day Not Applicable for Calculation 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 3 3 4 - 3 3 4 - - - 1 0 0 0 - 0 0 1 - - - 0 3 3 5 - 10 Jun 2026 08:42:08 N/A N/A N/A N/A N/A N/A
9 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012002 1 I-2 23 Apr 2026 - - - - - - - - 2 22 Apr 2026 Missing Diary 21 Apr 2026 - 20 Apr 2026 - 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 Day Not Applicable for Calculation 14 Apr 2026 Day Not Applicable for Calculation 13 Apr 2026 Day Not Applicable for Calculation - 3 3 6 5 5 4 - - - 2 - 0 0 1 1 1 1 - - - 1 5 - 10 Jun 2026 08:42:33 N/A N/A N/A N/A N/A N/A
10 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012002 1 I-4 06 May 2026 - - - - - - - - 1 05 May 2026 - 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 Day Not Applicable for Calculation 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 6 3 2 3 3 3 3 - - - 1 1 0 0 0 1 1 0 - - - 0 2 - 28 May 2026 14:43:38 N/A N/A N/A N/A N/A N/A
11 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012002 1 I-8 04 Jun 2026 - - - - - - - - 1 03 Jun 2026 - 02 Jun 2026 - 01 Jun 2026 - 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 Day Not Applicable for Calculation 26 May 2026 Day Not Applicable for Calculation 25 May 2026 Day Not Applicable for Calculation 3 4 3 3 3 3 4 - - - 1 0 0 0 0 0 0 1 - - - 0 2 - - N/A N/A N/A N/A N/A N/A
12 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012003 1 I-0 27 May 2026 Yes 13 May 2026 12 May 2026 12 May 2026 - - 3 - 2 26 May 2026 - 25 May 2026 - 24 May 2026 - 23 May 2026 - 22 May 2026 - 21 May 2026 - 20 May 2026 - 19 May 2026 Day Not Applicable for Calculation 18 May 2026 Day Not Applicable for Calculation 17 May 2026 Day Not Applicable for Calculation 6 9 7 8 9 7 8 - - - 3 2 2 2 2 1 1 1 - - - 2 7 8 10 - 27 May 2026 07:24:39 N/A N/A N/A N/A N/A N/A
13 77242113UCO3001 Adult Czech Republic DD5-CZ10001 Matej Falc CZ100012003 1 I-2 10 Jun 2026 - - - - - - - - 2 09 Jun 2026 - 08 Jun 2026 - 07 Jun 2026 - 06 Jun 2026 - 05 Jun 2026 - 04 Jun 2026 - 03 Jun 2026 - 02 Jun 2026 Day Not Applicable for Calculation 01 Jun 2026 Day Not Applicable for Calculation 31 May 2026 Day Not Applicable for Calculation 7 8 8 7 6 8 6 - - - 3 2 2 1 2 2 2 1 - - - 2 7 - 10 Jun 2026 07:30:18 N/A N/A N/A N/A N/A N/A
14 77242113UCO3001 Adult Czech Republic DD5-CZ10003 Leksa Vaclav CZ100032001 2 I-0 10 Jun 2026 Yes 27 May 2026 26 May 2026 26 May 2026 - - 2 - 2 09 Jun 2026 Missing Diary 08 Jun 2026 - 07 Jun 2026 - 06 Jun 2026 - 05 Jun 2026 - 04 Jun 2026 - 03 Jun 2026 - 02 Jun 2026 Day Not Applicable for Calculation 01 Jun 2026 Day Not Applicable for Calculation 31 May 2026 Day Not Applicable for Calculation - 4 4 4 5 4 5 - - - 1 - 2 2 2 2 2 2 - - - 2 5 5 7 - 10 Jun 2026 08:48:09 N/A N/A N/A N/A N/A N/A
15 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-0 20 Mar 2026 Yes 19 Feb 2026 - - - - 3 - 3 19 Mar 2026 - 18 Mar 2026 - 17 Mar 2026 - 16 Mar 2026 - 15 Mar 2026 - 14 Mar 2026 - 13 Mar 2026 - 12 Mar 2026 Day Not Applicable for Calculation 11 Mar 2026 Day Not Applicable for Calculation 10 Mar 2026 Day Not Applicable for Calculation 7 7 8 8 7 8 5 - - - 3 2 1 1 1 1 1 0 - - - 1 7 7 10 - 20 Mar 2026 07:03:23 N/A N/A N/A N/A N/A N/A
16 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-2 08 Apr 2026 - - - - - - - - 2 07 Apr 2026 Medication For Diarrhea 06 Apr 2026 Medication For Diarrhea 05 Apr 2026 Medication For Diarrhea 04 Apr 2026 Medication For Diarrhea 03 Apr 2026 Medication For Diarrhea 02 Apr 2026 Medication For Diarrhea 01 Apr 2026 Medication For Diarrhea 31 Mar 2026 Medication For Diarrhea;Day Not Applicable for Calculation 30 Mar 2026 Medication For Diarrhea;Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation - - - - - - - - - - Non-Evaluable - - - - - - - - - - Non-Evaluable Non-Evaluable Non-Evaluable Non-Evaluable - - N/A N/A N/A N/A N/A N/A
17 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-4 15 Apr 2026 - - - - - - - - 3 14 Apr 2026 - 13 Apr 2026 - 12 Apr 2026 - 11 Apr 2026 - 10 Apr 2026 - 09 Apr 2026 - 08 Apr 2026 - 07 Apr 2026 Medication For Diarrhea;Day Not Applicable for Calculation 06 Apr 2026 Medication For Diarrhea;Day Not Applicable for Calculation 05 Apr 2026 Medication For Diarrhea;Day Not Applicable for Calculation 9 22 20 19 17 18 18 - - - 3 1 3 2 2 2 2 2 - - - 2 8 - 04 May 2026 22:06:03 N/A N/A N/A N/A N/A N/A
18 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-8 18 May 2026 - - - - - - - - 2 17 May 2026 - 16 May 2026 - 15 May 2026 - 14 May 2026 - 13 May 2026 - 12 May 2026 - 11 May 2026 - 10 May 2026 Day Not Applicable for Calculation 09 May 2026 Day Not Applicable for Calculation 08 May 2026 Day Not Applicable for Calculation 7 5 9 7 7 8 8 - - - 3 1 1 1 1 1 1 1 - - - 1 6 - 04 Jun 2026 21:46:30 N/A N/A N/A N/A N/A N/A
19 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062001 1 I-12 08 Jun 2026 Yes 28 May 2026 - - - - 3 - 3 07 Jun 2026 - 06 Jun 2026 - 05 Jun 2026 - 04 Jun 2026 - 03 Jun 2026 - 02 Jun 2026 - 01 Jun 2026 Missing Diary 31 May 2026 Day Not Applicable for Calculation 30 May 2026 Day Not Applicable for Calculation 29 May 2026 Day Not Applicable for Calculation 6 5 5 5 7 6 - - - - 3 1 1 0 0 1 0 - - - - 1 7 7 10 - - Clinical Nonresponder No N/A N/A N/A N/A
20 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062002 1 I-0 26 May 2026 Yes 14 May 2026 13 May 2026 13 May 2026 - - 2 - 2 25 May 2026 - 24 May 2026 - 23 May 2026 - 22 May 2026 - 21 May 2026 - 20 May 2026 - 19 May 2026 - 18 May 2026 Day Not Applicable for Calculation 17 May 2026 Day Not Applicable for Calculation 16 May 2026 Day Not Applicable for Calculation 8 8 6 7 7 6 7 - - - 3 2 2 2 2 2 2 2 - - - 2 7 7 9 - 29 May 2026 15:45:00 N/A N/A N/A N/A N/A N/A
21 77242113UCO3001 Adult Czech Republic DD5-CZ10006 Michal Konecny CZ100062002 1 I-2 09 Jun 2026 - - - - - - - - 2 08 Jun 2026 - 07 Jun 2026 - 06 Jun 2026 - 05 Jun 2026 - 04 Jun 2026 - 03 Jun 2026 - 02 Jun 2026 - 01 Jun 2026 Day Not Applicable for Calculation 31 May 2026 Day Not Applicable for Calculation 30 May 2026 Day Not Applicable for Calculation 7 8 7 7 7 5 7 - - - 3 2 1 1 1 2 2 2 - - - 2 7 - - N/A N/A N/A N/A N/A N/A
22 77242113UCO3001 Adult Czech Republic DD5-CZ10009 Jiri Pumprla CZ100092001 1 I-0 05 May 2026 Yes 24 Apr 2026 23 Apr 2026 23 Apr 2026 - - 2 - 2 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 5 5 5 5 5 5 5 - - - 2 1 1 1 1 1 1 1 - - - 1 5 5 7 - 05 May 2026 11:19:40 N/A N/A N/A N/A N/A N/A
23 77242113UCO3001 Adult Czech Republic DD5-CZ10009 Jiri Pumprla CZ100092001 1 I-2 19 May 2026 - - - - - - - - 1 18 May 2026 - 17 May 2026 - 16 May 2026 - 15 May 2026 - 14 May 2026 - 13 May 2026 - 12 May 2026 - 11 May 2026 Day Not Applicable for Calculation 10 May 2026 Day Not Applicable for Calculation 09 May 2026 Day Not Applicable for Calculation 5 4 5 5 5 4 6 - - - 2 1 1 1 1 1 1 1 - - - 1 4 - 19 May 2026 10:38:25 N/A N/A N/A N/A N/A N/A
24 77242113UCO3001 Adult Czech Republic DD5-CZ10009 Jiri Pumprla CZ100092001 1 I-4 04 Jun 2026 - - - - - - - - 1 03 Jun 2026 - 02 Jun 2026 - 01 Jun 2026 - 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 Day Not Applicable for Calculation 26 May 2026 Day Not Applicable for Calculation 25 May 2026 Day Not Applicable for Calculation 2 3 2 3 3 2 3 - - - 1 0 0 0 0 0 0 0 - - - 0 2 - 04 Jun 2026 09:24:54 N/A N/A N/A N/A N/A N/A
25 77242113UCO3001 Adult Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 5 I-0 07 Apr 2026 Yes 24 Mar 2026 22 Mar 2026 22 Mar 2026 - - 2 - 2 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 - 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 28 Mar 2026 Day Not Applicable for Calculation 8 11 5 9 11 10 13 - - - 3 1 2 2 2 2 2 2 - - - 2 7 7 9 - 04 May 2026 08:44:52 N/A N/A N/A N/A N/A N/A
26 77242113UCO3001 Adult Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 5 I-2 22 Apr 2026 - - - - - - - - 2 21 Apr 2026 - 20 Apr 2026 - 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 - 14 Apr 2026 Day Not Applicable for Calculation 13 Apr 2026 Day Not Applicable for Calculation 12 Apr 2026 Day Not Applicable for Calculation 7 5 6 6 7 8 2 - - - 1 1 0 1 1 1 2 0 - - - 1 4 - 04 May 2026 08:45:07 N/A N/A N/A N/A N/A N/A
27 77242113UCO3001 Adult Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 5 I-4 07 May 2026 - - - - - - - - 1 06 May 2026 - 05 May 2026 - 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 Day Not Applicable for Calculation 28 Apr 2026 Day Not Applicable for Calculation 27 Apr 2026 Day Not Applicable for Calculation 8 7 7 8 4 11 7 - - - 1 2 1 1 1 0 1 1 - - - 1 3 - 01 Jun 2026 00:57:35 N/A N/A N/A N/A N/A N/A
28 77242113UCO3001 Adult Czech Republic DD5-CZ10012 Stefan Konecny CZ100122001 5 I-8 03 Jun 2026 - - - - - - - - 2 02 Jun 2026 - 01 Jun 2026 - 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 - 26 May 2026 Day Not Applicable for Calculation 25 May 2026 Day Not Applicable for Calculation 24 May 2026 Day Not Applicable for Calculation 5 9 7 5 5 9 7 - - - 1 1 1 1 0 3 0 1 - - - 1 4 - 03 Jun 2026 17:47:25 N/A N/A N/A N/A N/A N/A
29 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132001 1 I-0 24 Mar 2026 Yes 12 Mar 2026 11 Mar 2026 11 Mar 2026 - - 2 - 2 23 Mar 2026 - 22 Mar 2026 - 21 Mar 2026 - 20 Mar 2026 - 19 Mar 2026 - 18 Mar 2026 - 17 Mar 2026 - 16 Mar 2026 Day Not Applicable for Calculation 15 Mar 2026 Day Not Applicable for Calculation 14 Mar 2026 Day Not Applicable for Calculation 8 6 5 7 6 7 6 - - - 3 1 1 1 0 1 1 1 - - - 1 6 6 8 - 05 Apr 2026 22:41:27 N/A N/A N/A N/A N/A N/A
30 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132001 1 I-2 08 Apr 2026 - - - - - - - - 2 07 Apr 2026 - 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 Day Not Applicable for Calculation 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 5 2 3 6 5 5 5 - - - 2 0 0 0 0 1 1 0 - - - 0 4 - 28 May 2026 23:19:03 N/A N/A N/A N/A N/A N/A
31 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132001 1 I-4 21 Apr 2026 - - - - - - - - 0 20 Apr 2026 - 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 - 14 Apr 2026 - 13 Apr 2026 Day Not Applicable for Calculation 12 Apr 2026 Day Not Applicable for Calculation 11 Apr 2026 Day Not Applicable for Calculation 4 3 4 3 3 4 4 - - - 2 0 0 0 0 0 0 0 - - - 0 2 - 27 May 2026 12:54:41 N/A N/A N/A N/A N/A N/A
32 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132002 1 I-0 12 May 2026 Yes 21 Apr 2026 20 Apr 2026 21 Apr 2026 - - 2 - 2 11 May 2026 - 10 May 2026 - 09 May 2026 - 08 May 2026 - 07 May 2026 - 06 May 2026 - 05 May 2026 Missing Diary 04 May 2026 Day Not Applicable for Calculation 03 May 2026 Day Not Applicable for Calculation 02 May 2026 Day Not Applicable for Calculation 2 1 1 1 1 2 - - - - 0 0 0 0 0 0 0 - - - - 0 2 2 4 - 28 May 2026 23:19:30 N/A N/A N/A N/A N/A N/A
33 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132002 1 I-2 26 May 2026 - - - - - - - - 1 25 May 2026 - 24 May 2026 Missing Diary 23 May 2026 - 22 May 2026 - 21 May 2026 - 20 May 2026 - 19 May 2026 - 18 May 2026 Missing Diary;Day Not Applicable for Calculation 17 May 2026 Day Not Applicable for Calculation 16 May 2026 Day Not Applicable for Calculation 1 - 1 2 1 2 2 - - - 1 0 - 0 0 0 0 0 - - - 0 2 - 28 May 2026 23:19:51 N/A N/A N/A N/A N/A N/A
34 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132003 1 I-0 02 Jun 2026 Yes 25 May 2026 24 May 2026 24 May 2026 - - 2 - 2 01 Jun 2026 - 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 - 26 May 2026 - 25 May 2026 Endoscopy;Missing Diary;Day Not Applicable for Calculation 24 May 2026 Bowel Preparation for Procedure;Missing Diary;Day Not Applicable for Calculation 23 May 2026 Missing Diary;Day Not Applicable for Calculation 8 8 11 10 10 11 6 - - - 3 2 2 1 2 1 2 2 - - - 2 7 7 9 - 02 Jun 2026 08:17:40 N/A N/A N/A N/A N/A N/A
35 77242113UCO3001 Adult Czech Republic DD5-CZ10013 David Stepek CZ100132003 1 I-2 10 Jun 2026 - - - - - - - - 2 09 Jun 2026 - 08 Jun 2026 - 07 Jun 2026 - 06 Jun 2026 - 05 Jun 2026 - 04 Jun 2026 - 03 Jun 2026 - 02 Jun 2026 Day Not Applicable for Calculation 01 Jun 2026 Day Not Applicable for Calculation 31 May 2026 Day Not Applicable for Calculation 9 2 1 4 2 4 2 - - - 1 1 1 0 1 1 1 0 - - - 1 4 - - N/A N/A N/A N/A N/A N/A
36 77242113UCO3001 Adult Czech Republic DD5-CZ10016 Robert Mudr CZ100162001 1 I-0 28 May 2026 Yes 19 May 2026 18 May 2026 19 May 2026 - - 3 - 3 27 May 2026 - 26 May 2026 - 25 May 2026 - 24 May 2026 - 23 May 2026 - 22 May 2026 - 21 May 2026 - 20 May 2026 Day Not Applicable for Calculation 19 May 2026 Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation 18 May 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 14 15 15 15 15 15 15 - - - 3 2 3 3 2 2 3 3 - - - 3 9 9 12 - 28 May 2026 10:19:28 N/A N/A N/A N/A N/A N/A
37 77242113UCO3001 Adolescent Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 1 Unscheduled 1 04 May 2026 Yes 20 Apr 2026 12 Apr 2026 15 Apr 2026 - - 2 - 3 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 - 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 24 Apr 2026 Day Not Applicable for Calculation 5 6 6 7 6 3 3 - - - 2 0 0 0 0 0 0 0 - - - 0 5 4 7 - - N/A N/A N/A N/A N/A N/A
38 77242113UCO3001 Adolescent Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 1 I-0 18 May 2026 Yes 01 May 2026 01 May 2026 01 May 2026 - - 2 - 3 17 May 2026 - 16 May 2026 - 15 May 2026 - 14 May 2026 - 13 May 2026 - 12 May 2026 - 11 May 2026 - 10 May 2026 Day Not Applicable for Calculation 09 May 2026 Day Not Applicable for Calculation 08 May 2026 Day Not Applicable for Calculation 6 6 6 6 6 6 6 - - - 3 0 0 0 0 0 0 0 - - - 0 6 5 8 - 18 May 2026 08:39:27 N/A N/A N/A N/A N/A N/A
39 77242113UCO3001 Adolescent Czech Republic DD5-CZ10020 Lucie Gonsorcikova CZ100201001 1 I-2 01 Jun 2026 - - - - - - - - 3 31 May 2026 - 30 May 2026 Missing Diary 29 May 2026 Missing Diary 28 May 2026 Missing Diary 27 May 2026 - 26 May 2026 - 25 May 2026 - 24 May 2026 Day Not Applicable for Calculation 23 May 2026 Day Not Applicable for Calculation 22 May 2026 Day Not Applicable for Calculation 6 - - - 6 6 6 - - - 3 0 - - - 0 0 0 - - - 0 6 - - N/A N/A N/A N/A N/A N/A
40 77242113UCO3001 Adult Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 1 I-0 07 Apr 2026 Yes 16 Mar 2026 15 Mar 2026 16 Mar 2026 - - 3 - 3 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 - 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 28 Mar 2026 Day Not Applicable for Calculation 11 11 10 11 11 10 9 - - - 3 2 2 2 2 2 2 2 - - - 2 8 8 11 - 20 Apr 2026 09:27:58 N/A N/A N/A N/A N/A N/A
41 77242113UCO3001 Adult Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 1 I-2 20 Apr 2026 - - - - - - - - 3 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 - 14 Apr 2026 - 13 Apr 2026 - 12 Apr 2026 Day Not Applicable for Calculation 11 Apr 2026 Day Not Applicable for Calculation 10 Apr 2026 Day Not Applicable for Calculation 8 7 9 8 8 7 8 - - - 3 2 2 1 1 1 2 1 - - - 1 7 - 20 Apr 2026 09:29:01 N/A N/A N/A N/A N/A N/A
42 77242113UCO3001 Adult Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 1 I-4 05 May 2026 - - - - - - - - 1 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 6 6 6 6 7 7 6 - - - 3 0 0 1 1 1 1 1 - - - 1 5 - - N/A N/A N/A N/A N/A N/A
43 77242113UCO3001 Adult Czech Republic DD5-CZ10021 Martin Bortlik CZ100212001 1 I-8 02 Jun 2026 - - - - - - - - 1 01 Jun 2026 - 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 - 26 May 2026 - 25 May 2026 Day Not Applicable for Calculation 24 May 2026 Day Not Applicable for Calculation 23 May 2026 Day Not Applicable for Calculation 3 4 4 4 5 5 5 - - - 2 0 0 0 0 0 1 1 - - - 0 3 - 02 Jun 2026 14:44:34 N/A N/A N/A N/A N/A N/A
44 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222002 1 I-0 19 Feb 2026 Yes 11 Feb 2026 10 Feb 2026 11 Feb 2026 - - 2 - 2 18 Feb 2026 - 17 Feb 2026 - 16 Feb 2026 - 15 Feb 2026 - 14 Feb 2026 - 13 Feb 2026 - 12 Feb 2026 - 11 Feb 2026 Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation 10 Feb 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 09 Feb 2026 Day Not Applicable for Calculation 3 2 2 3 4 3 2 - - - 1 1 1 0 0 0 2 2 - - - 1 4 4 6 - 19 Feb 2026 15:37:49 N/A N/A N/A N/A N/A N/A
45 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-0 09 Mar 2026 Yes 11 Feb 2026 10 Feb 2026 11 Feb 2026 - - 2 - 2 08 Mar 2026 - 07 Mar 2026 - 06 Mar 2026 - 05 Mar 2026 - 04 Mar 2026 - 03 Mar 2026 Missing Diary 02 Mar 2026 Missing Diary 01 Mar 2026 Missing Diary;Day Not Applicable for Calculation 28 Feb 2026 Missing Diary;Day Not Applicable for Calculation 27 Feb 2026 Missing Diary;Day Not Applicable for Calculation 7 7 6 6 7 - - - - - 3 2 2 2 2 2 - - - - - 2 7 7 9 - 24 Mar 2026 14:23:10 N/A N/A N/A N/A N/A N/A
46 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-2 27 Mar 2026 - - - - - - - - 2 26 Mar 2026 - 25 Mar 2026 - 24 Mar 2026 - 23 Mar 2026 - 22 Mar 2026 - 21 Mar 2026 - 20 Mar 2026 - 19 Mar 2026 Day Not Applicable for Calculation 18 Mar 2026 Day Not Applicable for Calculation 17 Mar 2026 Day Not Applicable for Calculation 7 3 3 3 5 5 5 - - - 2 0 0 1 1 1 1 2 - - - 1 5 - 08 Apr 2026 07:36:56 N/A N/A N/A N/A N/A N/A
47 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-4 08 Apr 2026 - - - - - - - - 2 07 Apr 2026 - 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 - 31 Mar 2026 Day Not Applicable for Calculation 30 Mar 2026 Day Not Applicable for Calculation 29 Mar 2026 Day Not Applicable for Calculation 3 3 4 4 5 4 3 - - - 2 1 0 0 2 1 1 2 - - - 1 5 - 08 Apr 2026 07:59:35 N/A N/A N/A N/A N/A N/A
48 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-8 04 May 2026 - - - - - - - - 2 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 - 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 24 Apr 2026 Missing Diary;Day Not Applicable for Calculation 3 5 3 3 3 2 3 - - - 1 0 0 0 0 0 0 0 - - - 0 3 - 04 May 2026 08:08:40 N/A N/A N/A N/A N/A N/A
49 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222003 1 I-12 01 Jun 2026 Yes 20 May 2026 19 May 2026 20 May 2026 - - 3 - 2 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 - 26 May 2026 - 25 May 2026 - 24 May 2026 Day Not Applicable for Calculation 23 May 2026 Day Not Applicable for Calculation 22 May 2026 Day Not Applicable for Calculation 4 4 6 3 3 3 3 - - - 2 1 1 2 1 1 1 2 - - - 1 5 6 8 - 01 Jun 2026 14:25:57 Clinical Nonresponder No N/A N/A N/A N/A
50 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 1 I-0 09 Apr 2026 Yes 08 Apr 2026 31 Mar 2026 01 Apr 2026 - - 2 - 2 08 Apr 2026 Endoscopy 07 Apr 2026 - 06 Apr 2026 - 05 Apr 2026 - 04 Apr 2026 - 03 Apr 2026 - 02 Apr 2026 - 01 Apr 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 31 Mar 2026 Bowel Preparation for Procedure;Day Not Applicable for Calculation 30 Mar 2026 - - 3 3 4 3 4 3 - - 3 1 - 2 2 2 2 2 2 - - 2 2 5 5 7 - 29 May 2026 11:07:08 N/A N/A N/A N/A N/A N/A
51 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 1 I-2 22 Apr 2026 - - - - - - - - 2 21 Apr 2026 - 20 Apr 2026 - 19 Apr 2026 - 18 Apr 2026 - 17 Apr 2026 - 16 Apr 2026 - 15 Apr 2026 - 14 Apr 2026 Day Not Applicable for Calculation 13 Apr 2026 Day Not Applicable for Calculation 12 Apr 2026 Day Not Applicable for Calculation 3 3 5 3 2 3 2 - - - 1 1 2 2 1 1 1 2 - - - 1 4 - 05 May 2026 07:29:35 N/A N/A N/A N/A N/A N/A
52 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 1 I-4 05 May 2026 - - - - - - - - 2 04 May 2026 - 03 May 2026 - 02 May 2026 - 01 May 2026 - 30 Apr 2026 - 29 Apr 2026 - 28 Apr 2026 - 27 Apr 2026 Day Not Applicable for Calculation 26 Apr 2026 Day Not Applicable for Calculation 25 Apr 2026 Day Not Applicable for Calculation 4 2 2 2 2 2 2 - - - 1 1 1 1 1 2 1 1 - - - 1 4 - 05 May 2026 07:28:55 N/A N/A N/A N/A N/A N/A
53 77242113UCO3001 Adult Czech Republic DD5-CZ10022 Petr Hrabak CZ100222005 1 I-8 02 Jun 2026 - - - - - - - - 2 01 Jun 2026 - 31 May 2026 - 30 May 2026 - 29 May 2026 - 28 May 2026 - 27 May 2026 - 26 May 2026 - 25 May 2026 Day Not Applicable for Calculation 24 May 2026 Day Not Applicable for Calculation 23 May 2026 Day Not Applicable for Calculation 2 2 2 2 2 4 10 - - - 1 2 1 2 1 2 2 2 - - - 2 5 - 02 Jun 2026 08:18:08 N/A N/A N/A N/A N/A N/A
@@ -0,0 +1,83 @@
# jnj_tower_ingest v1.0.0
**Soubor:** `jnj_tower_ingest_v1.0.py`
**Datum:** 2026-06-10
**Autor:** vladimir.buzalka
**Běží:** Docker kontejner `python-runner` na Unraid Tower (192.168.1.76), u MongoDB.
## Co to je
Sjednocený **Tower-side ingest** JNJ e-mailů — spojuje dvě dříve oddělené poloviny
do jednoho běhu:
| Fáze | Dříve samostatně | Co dělá |
|---|---|---|
| **1. PARSE** | `parse_emails_tower_v1.3.py` | `.msg` z `/mnt/JNJEMAILS` → bohatý dokument v Mongo `emaily."vbuzalka@its.jnj.com"` (tělo, přílohy, hlavičky, MAPI props). `_id` = Internet Message-ID. |
| **2. SYNC** | `sync_jnj_state_v1.0.py` | nejnovější `/mnt/JNJEMAILS/db/jnjemails_*.db` (SQLite, **jen čtení** `mode=ro`) → zrcadlo do `jnj_messages` + doplnění `jnj_folder`/stavu do `emaily`. |
**Pořadí: parse BĚŽÍ PŘED sync.** Tím čerstvě naparsované maily dostanou cestu hned ve
stejném běhu (dřív: když sync předběhl parse, nový mail neměl co matchnout — sync
nezakládá stuby). Spojovací klíč všude = **Internet Message-ID = Mongo `_id`**.
## Inkrementálnost (vhodné pro cron každých 5 min)
- **PARSE** — parsuje jen `.msg` s `mtime` novějším než watermark
(`jnj_sync_state` / `_id="parse_state"``last_parse_mtime`).
- **První běh = seed:** watermark chybí → kandidáti = soubory, jejichž `filename`
ještě není v Mongu (jednorázový `distinct("filename")`); poté se watermark
nastaví na nejnovější soubor.
- **Další běhy = incremental:** jen `mtime > watermark`. Žádný sken Monga.
- `--full` reparsuje vše (upsert, idempotentní).
- **Indexy** se vytvářejí jen při `full`/`seed`/`--reindex` (v incremental už existují).
- **SYNC** — watermark `updated_at` (`jnj_sync_state` / `_id="watermark"`) + zkratka
`last_db` (stejná SQLite jako minule → okamžitý no-op, nesahá na Mongo data).
Dvě nezávislé události (nová `.msg` / nová `.db`) → skript udělá jen tu fázi, co má
práci; jinak levný no-op.
## Argumenty
| Argument | Význam |
|---|---|
| `--dry-run` | nic nezapíše, jen plán obou fází |
| `--full` | parse: reparsuj vše; sync: ignoruj watermark |
| `--limit N` | max N souborů (parse) / řádků (sync) — test |
| `--reindex` | vynutí indexy po parse fázi |
| `--force` | sync: ignoruj zkratku `last_db` |
| `--parse-only` | jen fáze PARSE |
| `--sync-only` | jen fáze SYNC |
## Spouštění
```bash
# Test:
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.0.py --dry-run
# Ostrý inkrementální běh (volá ho cron):
docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.0.py
# Plný reparse + reindex:
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.0.py --full --reindex
```
## Plánování (HOTOVO)
Unraid User Scripts úloha `jnj_state_sync` (cron `*/5 * * * *`) — wrapper s `flock`
volá `docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.0.py`.
Loguje jen reálnou práci/chyby do `/mnt/user/Scripts/logs/jnj_tower_ingest.log`
(grep `Zapisuji|PARSE hotovo|SYNC hotovo|CHYBA|Traceback`). Cron řádek/rozvrh se při
přepnutí ze `sync_jnj_state` neměnil — jen obsah wrapperu.
## Revert
Staré skripty `parse_emails_tower_v1.3.py` a `sync_jnj_state_v1.0.py` zůstávají v
`/scripts/` jako pojistka. Návrat = přepsat wrapper zpět na `sync_jnj_state_v1.0.py`.
## Závislosti
`extract-msg==0.55.0`, `olefile`, `pymongo`, `python-dateutil`, `sqlite3` (stdlib).
Python 3.10+.
## Historie verzí
- **1.0.0** 2026-06-10 — sjednocení `parse_emails_tower_v1.3` + `sync_jnj_state_v1.0`;
parse zinkrementálněn přes mtime watermark; indexy jen při full/seed/`--reindex`;
pořadí parse→sync.
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,514 @@
"""
==============================================================================
Skript: 1b_parse_emails_graph_delta_v1.0.py
Verze: 1.0
Datum: 2026-06-04
Autor: vladimir.buzalka
Popis:
Inkrementalni sync emailu pres Microsoft Graph DELTA QUERY.
Sourozenec `1_parse_emails_graph_v1.4.py` — kazdy resi jiny use case:
1_parse_emails_graph_v1.4.py = prvni plny import schranky
1b_parse_emails_graph_delta_v1.0.py = pravidelny sync (zmeny od minula)
Delta query je server-side change tracking — Graph si pamatuje "zalozku"
(deltaLink) a vraci jen to, co se od ni zmenilo:
- nove zpravy
- zmeny existujicich (isRead, flag, presun do jine slozky, kategorie)
- SMAZANE zpravy (@removed) — definitivne smazane, nikoli v kosi
Pro mail v "Deleted Items" delta nic specialniho nedela — je to porad
normalni zprava, jen s folder_path="Deleted Items". @removed prijde az
kdyz uzivatel vysype kos / Shift+Del.
State:
Kolekce `emaily.sync_state`, _id = "<mailbox>|<folder_id>".
{
mailbox, folder_id, folder_path,
delta_link, # plny URL s $deltatoken na pristi beh
last_run_at,
cumulative_new, cumulative_sync, cumulative_removed
}
Permanentne smazane zpravy:
Skript je NEMAZE z Mongo. Pouze nastavi:
permanently_deleted: True
permanently_deleted_at: <UTC datetime detekce>
Dohledani: col.find({"permanently_deleted": True})
Reuse:
Funkce extract_message / extract_sync_fields se nactou primo z modulu
1_parse_emails_graph_v1.4.py (importlib, file-based), aby se logika
extrahce nikdy nerozesla.
Spousteni:
python 1b_parse_emails_graph_delta_v1.0.py # VSECHNY schranky (mimo SKIP_MAILBOXES)
python 1b_parse_emails_graph_delta_v1.0.py --mailbox ordinace@buzalkova.cz # jedna schranka
python 1b_parse_emails_graph_delta_v1.0.py --mailbox ordinace@buzalkova.cz --folder Inbox
python 1b_parse_emails_graph_delta_v1.0.py --reset # zahodit deltaLinky a najet znova
python 1b_parse_emails_graph_delta_v1.0.py --dry-run # nic neulozit
SKIP_MAILBOXES (hardcoded):
vbuzalka@its.jnj.com — JNJ tenant, nemame Graph API pristup. Pro tuto
schranku je nutny samostatny skript (lokalni .msg).
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
==============================================================================
"""
from __future__ import annotations
import argparse
import importlib.util
import logging
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional
import msal
import requests
from pymongo import MongoClient, ASCENDING
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
SYNC_STATE_COL = "sync_state"
PAGE_SIZE = 100 # delta endpoint typicky vraci max 100/stranka
LOG_FILE = Path(__file__).parent / "delta_errors.log"
SCRIPT_VERSION = "1.0"
# Kolekce v `emaily` ktere NEJSOU mailboxy:
NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state"}
# Schranky, kde NEMAME Graph API pristup — pri bezneho behu se preskoci.
# Pro tyto je nutny separatni skript (napr. lokalni .msg parser).
SKIP_MAILBOXES = {
"vbuzalka@its.jnj.com", # JNJ tenant — nemame Graph credentials
}
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
# Co tahnout z delta endpointu (stejne jako MSG_SELECT v v1.4, mimo internetMessageHeaders
# ktere delta neumi vratit pro vsechny polozky — pro nove zpravy si je dotahneme
# samostatnym fetchem).
DELTA_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification"
)
# Pro plne nacteni nove zpravy (vcetne hlavicek + priloh) pouzijeme stejny
# select+expand jako v1.4
FULL_FETCH_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification,internetMessageHeaders"
)
FULL_FETCH_EXPAND = "attachments($select=id,name,contentType,size,isInline)"
# ─── Reuse extract logiky z v1.4 ──────────────────────────────────────────────
_HERE = Path(__file__).parent
_V14_PATH = _HERE / "1_parse_emails_graph_v1.4.py"
if not _V14_PATH.exists():
print(f"CHYBA: chybi sourozenec {_V14_PATH.name} — extract logiku nelze nacist", file=sys.stderr)
sys.exit(1)
_spec = importlib.util.spec_from_file_location("v14_parse", _V14_PATH)
_v14 = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(_v14)
extract_message = _v14.extract_message
extract_sync_fields = _v14.extract_sync_fields
# GRAPH_MAILBOX modul-level v v1.4 — pro extract neni potreba, ale pro
# konzistenci nastavujeme ho v main()
# ─── Graph API ────────────────────────────────────────────────────────────────
_graph_token: Optional[str] = None
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
class DeltaExpired(Exception):
"""deltaLink expiroval (HTTP 410) — je nutne zacit od plne delta znovu."""
def graph_get(url: str, params: dict = None, allow_410: bool = False) -> dict:
"""GET na Graph s retry pri 401. Pri 410 a allow_410=True vyhodi DeltaExpired."""
global _graph_token
if not _graph_token:
get_token()
for attempt in range(3):
r = requests.get(
url,
headers={"Authorization": f"Bearer {_graph_token}"},
params=params,
timeout=60,
)
if r.status_code == 401:
get_token()
continue
if r.status_code == 410 and allow_410:
raise DeltaExpired(url)
if r.status_code == 429:
# rate limit — respect Retry-After
wait = int(r.headers.get("Retry-After", "5"))
print(f" [429] cekam {wait}s ...")
time.sleep(wait)
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET failed after retries: {url}")
def get_all_folders(mailbox: str, parent_id: str = None, parent_path: str = "") -> list[dict]:
if parent_id is None:
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders"
else:
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders/{parent_id}/childFolders"
folders = []
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
while url:
data = graph_get(url, params)
for f in data.get("value", []):
path = f"{parent_path}/{f['displayName']}".lstrip("/")
folders.append({"id": f["id"], "path": path})
if f.get("childFolderCount", 0) > 0:
folders.extend(get_all_folders(mailbox, f["id"], path))
url = data.get("@odata.nextLink")
params = None
return folders
def fetch_full_message(mailbox: str, msg_id: str) -> Optional[dict]:
"""Stahne celou zpravu vcetne hlavicek a priloh — pro nove zpravy zachycene v delte."""
url = f"{GRAPH_URL}/users/{mailbox}/messages/{msg_id}"
params = {"$select": FULL_FETCH_SELECT, "$expand": FULL_FETCH_EXPAND}
try:
return graph_get(url, params)
except requests.HTTPError as e:
logging.error("fetch_full_message %s: %s", msg_id, e)
return None
# ─── Delta iterace ────────────────────────────────────────────────────────────
def iter_folder_delta(mailbox: str, folder_id: str, delta_link: Optional[str], limit: int = 0):
"""
Generator: vraci (item, final_delta_link).
item je dict s polozkou (bud zmena nebo {'@removed': ...}).
Posledni vyhozeny tuple ma final_delta_link != None (zbytek None).
Pri HTTP 410 (expirovany deltaLink) vyhodi DeltaExpired — caller ma
pustit znova s delta_link=None (= fresh full delta).
"""
if delta_link:
url = delta_link
params = None
else:
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders/{folder_id}/messages/delta"
params = {"$select": DELTA_SELECT, "$top": PAGE_SIZE}
n = 0
while url:
data = graph_get(url, params, allow_410=True)
params = None
for item in data.get("value", []):
yield item, None
n += 1
if limit and n >= limit:
# ulozime aspon stavajici nextLink jako "delta" — neni to ciste,
# ale pri --limit jde o test, takze pristi beh proste pocnize znovu
return
next_link = data.get("@odata.nextLink")
final_link = data.get("@odata.deltaLink")
if final_link:
# konec — predame final delta
yield None, final_link
return
url = next_link
# ─── Per-folder sync ──────────────────────────────────────────────────────────
def sync_folder(col, sync_col, mailbox: str, folder: dict, dry_run: bool, limit: int) -> dict:
"""Vrati statistiky."""
fid = folder["id"]
fpath = folder["path"]
state_id = f"{mailbox}|{fid}"
state = sync_col.find_one({"_id": state_id})
delta_link = state.get("delta_link") if state else None
is_first_run = delta_link is None
label = "FRESH" if is_first_run else "DELTA"
print(f"\n[{label}] {fpath}")
stats = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
final_delta = None
try:
gen = iter_folder_delta(mailbox, fid, delta_link, limit=limit)
for item, fin in gen:
if fin:
final_delta = fin
break
try:
process_item(col, mailbox, fpath, item, stats, dry_run)
except Exception as e:
stats["errors"] += 1
logging.error("process_item %s: %s", item.get("id", "?"), e)
except DeltaExpired:
print(f" [410] deltaLink expiroval — restart od fresh delta")
# rekurzivni restart s vymazanym statem
sync_col.delete_one({"_id": state_id})
return sync_folder(col, sync_col, mailbox, folder, dry_run, limit)
print(f" new={stats['new']} sync={stats['sync']} removed={stats['removed']} err={stats['errors']}")
# Ulozit sync_state pokud mame final_delta a neni dry run
if final_delta and not dry_run:
sync_col.update_one(
{"_id": state_id},
{
"$set": {
"mailbox": mailbox,
"folder_id": fid,
"folder_path": fpath,
"delta_link": final_delta,
"last_run_at": datetime.now(timezone.utc).replace(tzinfo=None),
},
"$inc": {
"cumulative_new": stats["new"],
"cumulative_sync": stats["sync"],
"cumulative_removed": stats["removed"],
"run_count": 1,
},
},
upsert=True,
)
elif not final_delta:
# neprisel deltaLink (napr. limit nebo chyba) — nemenime state, pristi beh
# bude pokracovat normalne podle stareho deltaLinku nebo zacne od fresh
if not is_first_run:
print(f" [pozn] delta neukoncena — pristi beh pojede od ulozeneho deltaLinku")
return stats
def process_item(col, mailbox: str, folder_path: str, item: dict, stats: dict, dry_run: bool):
"""Zpracuje jednu polozku z delta odpovedi."""
# 1) Smazana zprava (@removed)
if "@removed" in item or item.get("@removed.reason"):
graph_id = item.get("id")
if not graph_id:
return
if dry_run:
print(f" REMOVED graph_id={graph_id[:30]}...")
else:
col.update_one(
{"graph_id": graph_id},
{"$set": {
"permanently_deleted": True,
"permanently_deleted_at": datetime.now(timezone.utc).replace(tzinfo=None),
}},
)
stats["removed"] += 1
return
# 2) Nova nebo zmenena zprava — rozhodneme podle existence graph_id v Mongo
graph_id = item.get("id")
if not graph_id:
return
existing = col.find_one({"graph_id": graph_id}, {"_id": 1})
if existing:
# Existujici zprava — update jen sync poli (delta payload je obsahuje)
fields = extract_sync_fields(item, folder_path)
if dry_run:
print(f" SYNC {item.get('subject','')[:60]}")
else:
col.update_one({"_id": existing["_id"]}, {"$set": fields})
stats["sync"] += 1
else:
# Nova zprava — pro telo+attachments+headers fetchneme plnou verzi
full = fetch_full_message(mailbox, graph_id)
if full is None:
stats["errors"] += 1
return
doc = extract_message(full, folder_path)
if doc is None:
stats["errors"] += 1
return
if dry_run:
print(f" NEW {doc.get('subject','')[:60]}")
else:
col.update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True)
stats["new"] += 1
# ─── Indexy pro sync_state ────────────────────────────────────────────────────
def ensure_sync_state_indexes(sync_col):
sync_col.create_index([("mailbox", ASCENDING), ("folder_id", ASCENDING)])
sync_col.create_index([("last_run_at", ASCENDING)])
def ensure_perm_deleted_index(col):
col.create_index([("permanently_deleted", ASCENDING)], sparse=True)
# ─── Main ─────────────────────────────────────────────────────────────────────
def discover_mailboxes(db) -> list[str]:
"""Vrati seznam mailboxu = vsechny kolekce v `emaily` mimo NON_MAILBOX_COLLECTIONS
a SKIP_MAILBOXES."""
out = []
for name in sorted(db.list_collection_names()):
if name in NON_MAILBOX_COLLECTIONS:
continue
if name in SKIP_MAILBOXES:
print(f" [skip] {name} — v SKIP_MAILBOXES (neni Graph pristup)")
continue
out.append(name)
return out
def sync_mailbox(client, mailbox: str, args) -> dict:
"""Sync jedne schranky. Vraci totals dict."""
_v14.GRAPH_MAILBOX = mailbox
print(f"\n========== {mailbox} ==========")
col = client[MONGO_DB][mailbox]
sync_col = client[MONGO_DB][SYNC_STATE_COL]
if not args.dry_run:
ensure_sync_state_indexes(sync_col)
ensure_perm_deleted_index(col)
if args.reset:
n = sync_col.delete_many({"mailbox": mailbox}).deleted_count
print(f" --reset: smazano {n} deltaLinku pro {mailbox}")
print("Nacitam seznam slozek...")
try:
folders = get_all_folders(mailbox)
except requests.HTTPError as e:
print(f" CHYBA: nelze nacist slozky pro {mailbox}: {e}")
logging.error("get_all_folders %s: %s", mailbox, e)
return {"new": 0, "sync": 0, "removed": 0, "errors": 1}
if args.folder:
folders = [f for f in folders if args.folder.lower() in f["path"].lower()]
print(f" Slozek ke zpracovani: {len(folders)}")
totals = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
for folder in folders:
s = sync_folder(col, sync_col, mailbox, folder, args.dry_run, args.limit)
for k in totals:
totals[k] += s[k]
print(f" -> mailbox total: new={totals['new']} sync={totals['sync']} removed={totals['removed']} err={totals['errors']}")
return totals
def main():
ap = argparse.ArgumentParser(description=f"parse_emails_graph delta sync v{SCRIPT_VERSION}")
ap.add_argument("--mailbox", default="",
help="E-mail schranky (= kolekce v Mongo). "
"Bez argumentu projede vsechny schranky z `emaily` (mimo SKIP_MAILBOXES).")
ap.add_argument("--folder", default="", help="Filtruje slozky obsahujici tento retezec (default: vsechny)")
ap.add_argument("--limit", type=int, default=0, help="Max polozek na slozku (test)")
ap.add_argument("--reset", action="store_true",
help="Smaze deltaLinky pro vybrane schranky — pristi beh zacne od fresh delta")
ap.add_argument("--dry-run", action="store_true", help="Nic neulozi do Mongo, jen vypise co by se stalo")
args = ap.parse_args()
print(f"=== Delta sync v{SCRIPT_VERSION} ===")
if args.dry_run:
print(" DRY-RUN — zadne zmeny v Mongo")
print("Pripojuji se k MongoDB...")
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
client.admin.command("ping")
db = client[MONGO_DB]
if args.mailbox:
if args.mailbox in SKIP_MAILBOXES:
print(f" CHYBA: {args.mailbox} je v SKIP_MAILBOXES — neni Graph pristup.")
sys.exit(2)
mailboxes = [args.mailbox]
else:
mailboxes = discover_mailboxes(db)
print(f" Schranky ke zpracovani: {len(mailboxes)}")
for m in mailboxes:
print(f" {m}")
print("Token Graph API...")
get_token()
print(" OK")
t0 = time.time()
grand = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
per_mailbox = []
for mb in mailboxes:
try:
s = sync_mailbox(client, mb, args)
except Exception as e:
print(f" FATAL pri sync {mb}: {e}")
logging.error("sync_mailbox %s: %s", mb, e)
s = {"new": 0, "sync": 0, "removed": 0, "errors": 1}
per_mailbox.append((mb, s))
for k in grand:
grand[k] += s[k]
dt = time.time() - t0
print(f"\n=== SHRNUTI ===")
for mb, s in per_mailbox:
print(f" {mb:40} new={s['new']:>5} sync={s['sync']:>5} removed={s['removed']:>4} err={s['errors']:>3}")
print(f" {'TOTAL':40} new={grand['new']:>5} sync={grand['sync']:>5} removed={grand['removed']:>4} err={grand['errors']:>3}")
print(f" trvalo: {dt:.1f} s")
return 1 if grand["errors"] > 0 else 0
if __name__ == "__main__":
sys.exit(main() or 0)
@@ -0,0 +1,523 @@
"""
==============================================================================
Skript: 1b_parse_emails_graph_delta_v1.1.py
Verze: 1.1
Datum: 2026-06-10
Autor: vladimir.buzalka
Zmeny v1.1 (2026-06-10):
- Bugfix: NON_MAILBOX_COLLECTIONS rozsireno o "jnj_messages" a
"jnj_sync_state" (pomocne kolekce JNJ folder trackingu). Predtim je
discover_mailboxes bral jako schranky -> Graph 404 na
/users/jnj_messages/mailFolders -> cely krok 1b FAIL(1) pri kazdem behu.
Popis:
Inkrementalni sync emailu pres Microsoft Graph DELTA QUERY.
Sourozenec `1_parse_emails_graph_v1.4.py` — kazdy resi jiny use case:
1_parse_emails_graph_v1.4.py = prvni plny import schranky
1b_parse_emails_graph_delta_v1.1.py = pravidelny sync (zmeny od minula)
Delta query je server-side change tracking — Graph si pamatuje "zalozku"
(deltaLink) a vraci jen to, co se od ni zmenilo:
- nove zpravy
- zmeny existujicich (isRead, flag, presun do jine slozky, kategorie)
- SMAZANE zpravy (@removed) — definitivne smazane, nikoli v kosi
Pro mail v "Deleted Items" delta nic specialniho nedela — je to porad
normalni zprava, jen s folder_path="Deleted Items". @removed prijde az
kdyz uzivatel vysype kos / Shift+Del.
State:
Kolekce `emaily.sync_state`, _id = "<mailbox>|<folder_id>".
{
mailbox, folder_id, folder_path,
delta_link, # plny URL s $deltatoken na pristi beh
last_run_at,
cumulative_new, cumulative_sync, cumulative_removed
}
Permanentne smazane zpravy:
Skript je NEMAZE z Mongo. Pouze nastavi:
permanently_deleted: True
permanently_deleted_at: <UTC datetime detekce>
Dohledani: col.find({"permanently_deleted": True})
Reuse:
Funkce extract_message / extract_sync_fields se nactou primo z modulu
1_parse_emails_graph_v1.4.py (importlib, file-based), aby se logika
extrahce nikdy nerozesla.
Spousteni:
python 1b_parse_emails_graph_delta_v1.1.py # VSECHNY schranky (mimo SKIP_MAILBOXES)
python 1b_parse_emails_graph_delta_v1.1.py --mailbox ordinace@buzalkova.cz # jedna schranka
python 1b_parse_emails_graph_delta_v1.1.py --mailbox ordinace@buzalkova.cz --folder Inbox
python 1b_parse_emails_graph_delta_v1.1.py --reset # zahodit deltaLinky a najet znova
python 1b_parse_emails_graph_delta_v1.1.py --dry-run # nic neulozit
SKIP_MAILBOXES (hardcoded):
vbuzalka@its.jnj.com — JNJ tenant, nemame Graph API pristup. Pro tuto
schranku je nutny samostatny skript (lokalni .msg).
Zavislosti:
msal, requests, pymongo, python-dateutil
Python 3.10+
==============================================================================
"""
from __future__ import annotations
import argparse
import importlib.util
import logging
import sys
import time
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional
import msal
import requests
from pymongo import MongoClient, ASCENDING
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
GRAPH_URL = "https://graph.microsoft.com/v1.0"
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
SYNC_STATE_COL = "sync_state"
PAGE_SIZE = 100 # delta endpoint typicky vraci max 100/stranka
LOG_FILE = Path(__file__).parent / "delta_errors.log"
SCRIPT_VERSION = "1.1"
# Kolekce v `emaily` ktere NEJSOU mailboxy:
# (jnj_messages + jnj_sync_state = pomocne kolekce JNJ folder trackingu,
# bez exclude je discover_mailboxes bere jako schranky -> Graph 404 -> FAIL)
NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state",
"jnj_messages", "jnj_sync_state"}
# Schranky, kde NEMAME Graph API pristup — pri bezneho behu se preskoci.
# Pro tyto je nutny separatni skript (napr. lokalni .msg parser).
SKIP_MAILBOXES = {
"vbuzalka@its.jnj.com", # JNJ tenant — nemame Graph credentials
}
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
# Co tahnout z delta endpointu (stejne jako MSG_SELECT v v1.4, mimo internetMessageHeaders
# ktere delta neumi vratit pro vsechny polozky — pro nove zpravy si je dotahneme
# samostatnym fetchem).
DELTA_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification"
)
# Pro plne nacteni nove zpravy (vcetne hlavicek + priloh) pouzijeme stejny
# select+expand jako v1.4
FULL_FETCH_SELECT = (
"id,internetMessageId,subject,bodyPreview,body,"
"importance,isRead,isDraft,hasAttachments,"
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
"conversationId,conversationIndex,parentFolderId,"
"categories,flag,inferenceClassification,internetMessageHeaders"
)
FULL_FETCH_EXPAND = "attachments($select=id,name,contentType,size,isInline)"
# ─── Reuse extract logiky z v1.4 ──────────────────────────────────────────────
_HERE = Path(__file__).parent
_V14_PATH = _HERE / "1_parse_emails_graph_v1.4.py"
if not _V14_PATH.exists():
print(f"CHYBA: chybi sourozenec {_V14_PATH.name} — extract logiku nelze nacist", file=sys.stderr)
sys.exit(1)
_spec = importlib.util.spec_from_file_location("v14_parse", _V14_PATH)
_v14 = importlib.util.module_from_spec(_spec)
_spec.loader.exec_module(_v14)
extract_message = _v14.extract_message
extract_sync_fields = _v14.extract_sync_fields
# GRAPH_MAILBOX modul-level v v1.4 — pro extract neni potreba, ale pro
# konzistenci nastavujeme ho v main()
# ─── Graph API ────────────────────────────────────────────────────────────────
_graph_token: Optional[str] = None
def get_token() -> str:
global _graph_token
app = msal.ConfidentialClientApplication(
GRAPH_CLIENT_ID,
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
client_credential=GRAPH_CLIENT_SECRET,
)
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
if "access_token" not in result:
raise RuntimeError(f"Graph auth failed: {result}")
_graph_token = result["access_token"]
return _graph_token
class DeltaExpired(Exception):
"""deltaLink expiroval (HTTP 410) — je nutne zacit od plne delta znovu."""
def graph_get(url: str, params: dict = None, allow_410: bool = False) -> dict:
"""GET na Graph s retry pri 401. Pri 410 a allow_410=True vyhodi DeltaExpired."""
global _graph_token
if not _graph_token:
get_token()
for attempt in range(3):
r = requests.get(
url,
headers={"Authorization": f"Bearer {_graph_token}"},
params=params,
timeout=60,
)
if r.status_code == 401:
get_token()
continue
if r.status_code == 410 and allow_410:
raise DeltaExpired(url)
if r.status_code == 429:
# rate limit — respect Retry-After
wait = int(r.headers.get("Retry-After", "5"))
print(f" [429] cekam {wait}s ...")
time.sleep(wait)
continue
r.raise_for_status()
return r.json()
raise RuntimeError(f"Graph GET failed after retries: {url}")
def get_all_folders(mailbox: str, parent_id: str = None, parent_path: str = "") -> list[dict]:
if parent_id is None:
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders"
else:
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders/{parent_id}/childFolders"
folders = []
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
while url:
data = graph_get(url, params)
for f in data.get("value", []):
path = f"{parent_path}/{f['displayName']}".lstrip("/")
folders.append({"id": f["id"], "path": path})
if f.get("childFolderCount", 0) > 0:
folders.extend(get_all_folders(mailbox, f["id"], path))
url = data.get("@odata.nextLink")
params = None
return folders
def fetch_full_message(mailbox: str, msg_id: str) -> Optional[dict]:
"""Stahne celou zpravu vcetne hlavicek a priloh — pro nove zpravy zachycene v delte."""
url = f"{GRAPH_URL}/users/{mailbox}/messages/{msg_id}"
params = {"$select": FULL_FETCH_SELECT, "$expand": FULL_FETCH_EXPAND}
try:
return graph_get(url, params)
except requests.HTTPError as e:
logging.error("fetch_full_message %s: %s", msg_id, e)
return None
# ─── Delta iterace ────────────────────────────────────────────────────────────
def iter_folder_delta(mailbox: str, folder_id: str, delta_link: Optional[str], limit: int = 0):
"""
Generator: vraci (item, final_delta_link).
item je dict s polozkou (bud zmena nebo {'@removed': ...}).
Posledni vyhozeny tuple ma final_delta_link != None (zbytek None).
Pri HTTP 410 (expirovany deltaLink) vyhodi DeltaExpired — caller ma
pustit znova s delta_link=None (= fresh full delta).
"""
if delta_link:
url = delta_link
params = None
else:
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders/{folder_id}/messages/delta"
params = {"$select": DELTA_SELECT, "$top": PAGE_SIZE}
n = 0
while url:
data = graph_get(url, params, allow_410=True)
params = None
for item in data.get("value", []):
yield item, None
n += 1
if limit and n >= limit:
# ulozime aspon stavajici nextLink jako "delta" — neni to ciste,
# ale pri --limit jde o test, takze pristi beh proste pocnize znovu
return
next_link = data.get("@odata.nextLink")
final_link = data.get("@odata.deltaLink")
if final_link:
# konec — predame final delta
yield None, final_link
return
url = next_link
# ─── Per-folder sync ──────────────────────────────────────────────────────────
def sync_folder(col, sync_col, mailbox: str, folder: dict, dry_run: bool, limit: int) -> dict:
"""Vrati statistiky."""
fid = folder["id"]
fpath = folder["path"]
state_id = f"{mailbox}|{fid}"
state = sync_col.find_one({"_id": state_id})
delta_link = state.get("delta_link") if state else None
is_first_run = delta_link is None
label = "FRESH" if is_first_run else "DELTA"
print(f"\n[{label}] {fpath}")
stats = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
final_delta = None
try:
gen = iter_folder_delta(mailbox, fid, delta_link, limit=limit)
for item, fin in gen:
if fin:
final_delta = fin
break
try:
process_item(col, mailbox, fpath, item, stats, dry_run)
except Exception as e:
stats["errors"] += 1
logging.error("process_item %s: %s", item.get("id", "?"), e)
except DeltaExpired:
print(f" [410] deltaLink expiroval — restart od fresh delta")
# rekurzivni restart s vymazanym statem
sync_col.delete_one({"_id": state_id})
return sync_folder(col, sync_col, mailbox, folder, dry_run, limit)
print(f" new={stats['new']} sync={stats['sync']} removed={stats['removed']} err={stats['errors']}")
# Ulozit sync_state pokud mame final_delta a neni dry run
if final_delta and not dry_run:
sync_col.update_one(
{"_id": state_id},
{
"$set": {
"mailbox": mailbox,
"folder_id": fid,
"folder_path": fpath,
"delta_link": final_delta,
"last_run_at": datetime.now(timezone.utc).replace(tzinfo=None),
},
"$inc": {
"cumulative_new": stats["new"],
"cumulative_sync": stats["sync"],
"cumulative_removed": stats["removed"],
"run_count": 1,
},
},
upsert=True,
)
elif not final_delta:
# neprisel deltaLink (napr. limit nebo chyba) — nemenime state, pristi beh
# bude pokracovat normalne podle stareho deltaLinku nebo zacne od fresh
if not is_first_run:
print(f" [pozn] delta neukoncena — pristi beh pojede od ulozeneho deltaLinku")
return stats
def process_item(col, mailbox: str, folder_path: str, item: dict, stats: dict, dry_run: bool):
"""Zpracuje jednu polozku z delta odpovedi."""
# 1) Smazana zprava (@removed)
if "@removed" in item or item.get("@removed.reason"):
graph_id = item.get("id")
if not graph_id:
return
if dry_run:
print(f" REMOVED graph_id={graph_id[:30]}...")
else:
col.update_one(
{"graph_id": graph_id},
{"$set": {
"permanently_deleted": True,
"permanently_deleted_at": datetime.now(timezone.utc).replace(tzinfo=None),
}},
)
stats["removed"] += 1
return
# 2) Nova nebo zmenena zprava — rozhodneme podle existence graph_id v Mongo
graph_id = item.get("id")
if not graph_id:
return
existing = col.find_one({"graph_id": graph_id}, {"_id": 1})
if existing:
# Existujici zprava — update jen sync poli (delta payload je obsahuje)
fields = extract_sync_fields(item, folder_path)
if dry_run:
print(f" SYNC {item.get('subject','')[:60]}")
else:
col.update_one({"_id": existing["_id"]}, {"$set": fields})
stats["sync"] += 1
else:
# Nova zprava — pro telo+attachments+headers fetchneme plnou verzi
full = fetch_full_message(mailbox, graph_id)
if full is None:
stats["errors"] += 1
return
doc = extract_message(full, folder_path)
if doc is None:
stats["errors"] += 1
return
if dry_run:
print(f" NEW {doc.get('subject','')[:60]}")
else:
col.update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True)
stats["new"] += 1
# ─── Indexy pro sync_state ────────────────────────────────────────────────────
def ensure_sync_state_indexes(sync_col):
sync_col.create_index([("mailbox", ASCENDING), ("folder_id", ASCENDING)])
sync_col.create_index([("last_run_at", ASCENDING)])
def ensure_perm_deleted_index(col):
col.create_index([("permanently_deleted", ASCENDING)], sparse=True)
# ─── Main ─────────────────────────────────────────────────────────────────────
def discover_mailboxes(db) -> list[str]:
"""Vrati seznam mailboxu = vsechny kolekce v `emaily` mimo NON_MAILBOX_COLLECTIONS
a SKIP_MAILBOXES."""
out = []
for name in sorted(db.list_collection_names()):
if name in NON_MAILBOX_COLLECTIONS:
continue
if name in SKIP_MAILBOXES:
print(f" [skip] {name} — v SKIP_MAILBOXES (neni Graph pristup)")
continue
out.append(name)
return out
def sync_mailbox(client, mailbox: str, args) -> dict:
"""Sync jedne schranky. Vraci totals dict."""
_v14.GRAPH_MAILBOX = mailbox
print(f"\n========== {mailbox} ==========")
col = client[MONGO_DB][mailbox]
sync_col = client[MONGO_DB][SYNC_STATE_COL]
if not args.dry_run:
ensure_sync_state_indexes(sync_col)
ensure_perm_deleted_index(col)
if args.reset:
n = sync_col.delete_many({"mailbox": mailbox}).deleted_count
print(f" --reset: smazano {n} deltaLinku pro {mailbox}")
print("Nacitam seznam slozek...")
try:
folders = get_all_folders(mailbox)
except requests.HTTPError as e:
print(f" CHYBA: nelze nacist slozky pro {mailbox}: {e}")
logging.error("get_all_folders %s: %s", mailbox, e)
return {"new": 0, "sync": 0, "removed": 0, "errors": 1}
if args.folder:
folders = [f for f in folders if args.folder.lower() in f["path"].lower()]
print(f" Slozek ke zpracovani: {len(folders)}")
totals = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
for folder in folders:
s = sync_folder(col, sync_col, mailbox, folder, args.dry_run, args.limit)
for k in totals:
totals[k] += s[k]
print(f" -> mailbox total: new={totals['new']} sync={totals['sync']} removed={totals['removed']} err={totals['errors']}")
return totals
def main():
ap = argparse.ArgumentParser(description=f"parse_emails_graph delta sync v{SCRIPT_VERSION}")
ap.add_argument("--mailbox", default="",
help="E-mail schranky (= kolekce v Mongo). "
"Bez argumentu projede vsechny schranky z `emaily` (mimo SKIP_MAILBOXES).")
ap.add_argument("--folder", default="", help="Filtruje slozky obsahujici tento retezec (default: vsechny)")
ap.add_argument("--limit", type=int, default=0, help="Max polozek na slozku (test)")
ap.add_argument("--reset", action="store_true",
help="Smaze deltaLinky pro vybrane schranky — pristi beh zacne od fresh delta")
ap.add_argument("--dry-run", action="store_true", help="Nic neulozi do Mongo, jen vypise co by se stalo")
args = ap.parse_args()
print(f"=== Delta sync v{SCRIPT_VERSION} ===")
if args.dry_run:
print(" DRY-RUN — zadne zmeny v Mongo")
print("Pripojuji se k MongoDB...")
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
client.admin.command("ping")
db = client[MONGO_DB]
if args.mailbox:
if args.mailbox in SKIP_MAILBOXES:
print(f" CHYBA: {args.mailbox} je v SKIP_MAILBOXES — neni Graph pristup.")
sys.exit(2)
mailboxes = [args.mailbox]
else:
mailboxes = discover_mailboxes(db)
print(f" Schranky ke zpracovani: {len(mailboxes)}")
for m in mailboxes:
print(f" {m}")
print("Token Graph API...")
get_token()
print(" OK")
t0 = time.time()
grand = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
per_mailbox = []
for mb in mailboxes:
try:
s = sync_mailbox(client, mb, args)
except Exception as e:
print(f" FATAL pri sync {mb}: {e}")
logging.error("sync_mailbox %s: %s", mb, e)
s = {"new": 0, "sync": 0, "removed": 0, "errors": 1}
per_mailbox.append((mb, s))
for k in grand:
grand[k] += s[k]
dt = time.time() - t0
print(f"\n=== SHRNUTI ===")
for mb, s in per_mailbox:
print(f" {mb:40} new={s['new']:>5} sync={s['sync']:>5} removed={s['removed']:>4} err={s['errors']:>3}")
print(f" {'TOTAL':40} new={grand['new']:>5} sync={grand['sync']:>5} removed={grand['removed']:>4} err={grand['errors']:>3}")
print(f" trvalo: {dt:.1f} s")
return 1 if grand["errors"] > 0 else 0
if __name__ == "__main__":
sys.exit(main() or 0)
@@ -0,0 +1,579 @@
"""
==============================================================================
Skript: enrich_fulltext_emails_v1.3.py
Verze: 1.3
Datum: 2026-06-04
Autor: vladimir.buzalka
Popis:
Vytahne plny text z emailu ulozenych v MongoDB (db: emaily) a ulozi ho do
PostgreSQL (db: MongoEmaily, tabulka: emails) s GIN tsvector indexem.
Emaily se NESTAHUJI znovu - tela uz jsou v Mongo z parse_emails_graph_v1.4
(a refetch_text_bodies_v1.0 pro stare plain-text emaily).
Tento skript jen vybere prvni dostupne telo a posle text do PG na fulltext.
Zmeny v1.3.1 (2026-06-09):
- Bugfix: _clean_for_pg nahrazuje osamocene surrogate (\\ud800-\\udfff) za U+FFFD.
Drive jeden mail se surrogaty (napr. JNJ .msg) shodil celou davku a krok 5
skoncil FAIL. EXTRACTOR_VERSION zustava 1.2 (neni zmena fallback logiky).
Zmeny v1.3 vs v1.2:
- Bugfix: NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state"}
(sync_state pribyla v delta syncu, predtim ji v1.2 brala jako mailbox).
- --index-reset: pred zpracovanim schranky vymaze vsechny jeji emaily z PG
(force re-extract; pouzij kdyz povysis EXTRACTOR_VERSION nebo chces ciste).
- Vylepseny header per-mailbox: ukaze pocet v Mongu, v PG a k zpracovani.
Zmeny v1.2 vs v1.1:
- S/MIME emaily: pokud unwrap_smime_v1.0 ulozil smime_body_text/smime_body_html,
pouzije se PREFEROVANE pred bezvyznamnym wrapper telem.
- body_source: nova hodnota "smime".
- EXTRACTOR_VERSION=1.2 -> vsechny existujici emaily v PG se preparsuji.
Zmeny v1.1 vs v1.0:
- Fallback poradi rozsireno o body_text.
- body_source umi novou hodnotu "text" (plne plain-text telo, max 2 MB).
Zdroj:
MongoDB 192.168.1.76 db=emaily kolekce=<mailbox>
(krome NON_MAILBOX_COLLECTIONS)
Cil:
PostgreSQL 192.168.1.76 db=MongoEmaily tabulka=emails
tsvector config 'soubory' (sdileny - simple + unaccent)
Inkrementalita:
Pokud (mailbox, message_id) jiz existuje a extractor_version je aktualni
a modified_at v Mongo neni novejsi -> skip. Pri zmene verze extractoru
se vse preparsuje. --index-reset to obejde a smaze PG pred behom.
Spusteni:
python enrich_fulltext_emails_v1.3.py # vsechny schranky
python enrich_fulltext_emails_v1.3.py --mailbox ordinace@buzalkova.cz
python enrich_fulltext_emails_v1.3.py --limit 500 # test
python enrich_fulltext_emails_v1.3.py --mailbox X --index-reset # smaze PG schranky a re-extrahuje vsechno
python enrich_fulltext_emails_v1.3.py --index-reset # smaze CELY index a postavi znovu (POMALE!)
==============================================================================
"""
from __future__ import annotations
import argparse
import re
import sys
import time
import traceback
from datetime import datetime, timezone
from typing import Optional
import psycopg
from bs4 import BeautifulSoup
from pymongo import MongoClient
# --- konfigurace ------------------------------------------------------------
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
PG_DSN = ("host=192.168.1.76 port=5432 dbname=MongoEmaily "
"user=vladimir.buzalka password=Vlado7309208104++")
EXTRACTOR_VERSION = "1.2" # NEMENIT pokud nemenis fallback logiku!
MAX_TEXT_BYTES = 5 * 1024 * 1024 # plain text max 5 MB
# Kolekce v `emaily` ktere NEJSOU mailboxy (nezpracovavame)
NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state"}
BATCH_SIZE = 100
# --- SCHEMA -----------------------------------------------------------------
SCHEMA_SQL = """
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE EXTENSION IF NOT EXISTS pg_trgm;
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_ts_config WHERE cfgname = 'soubory') THEN
CREATE TEXT SEARCH CONFIGURATION soubory ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION soubory
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, simple;
END IF;
END$$;
CREATE TABLE IF NOT EXISTS emails (
id BIGSERIAL PRIMARY KEY,
mailbox TEXT NOT NULL,
message_id TEXT NOT NULL,
graph_id TEXT,
conversation_id TEXT,
folder_path TEXT,
subject TEXT,
sender_email TEXT,
sender_name TEXT,
to_addrs TEXT,
cc_addrs TEXT,
bcc_addrs TEXT,
sent_at TIMESTAMPTZ,
received_at TIMESTAMPTZ,
modified_at TIMESTAMPTZ,
is_read BOOLEAN,
is_draft BOOLEAN,
has_attachments BOOLEAN,
attachment_count INT,
attachments_summary TEXT,
body TEXT,
body_length INT,
body_source TEXT, -- 'html' | 'preview' | 'empty'
tsv tsvector GENERATED ALWAYS AS (
to_tsvector('soubory'::regconfig,
left(
coalesce(subject, '') || ' ' ||
coalesce(sender_email, '') || ' ' ||
coalesce(sender_name, '') || ' ' ||
coalesce(to_addrs, '') || ' ' ||
coalesce(cc_addrs, '') || ' ' ||
coalesce(attachments_summary, '') || ' ' ||
coalesce(body, ''),
800000)
)
) STORED,
extracted_at TIMESTAMPTZ DEFAULT now(),
extractor_version TEXT,
ok BOOLEAN,
error TEXT,
UNIQUE (mailbox, message_id)
);
CREATE INDEX IF NOT EXISTS emails_tsv_gin ON emails USING gin(tsv);
CREATE INDEX IF NOT EXISTS emails_subject_trgm ON emails USING gin(subject gin_trgm_ops);
CREATE INDEX IF NOT EXISTS emails_sender_email_idx ON emails(sender_email);
CREATE INDEX IF NOT EXISTS emails_mailbox_idx ON emails(mailbox);
CREATE INDEX IF NOT EXISTS emails_received_idx ON emails(received_at DESC);
CREATE INDEX IF NOT EXISTS emails_conv_idx ON emails(conversation_id);
"""
# --- HELPERY ----------------------------------------------------------------
_CTRL_RX = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f]")
_WS_RX = re.compile(r"[ \t]+")
_NL_RX = re.compile(r"\n{3,}")
# Osamocene surrogate (\ud800-\udfff) jsou neplatne v UTF-8 -> psycopg pri zapisu
# vyhodi UnicodeEncodeError ("surrogates not allowed") a shodi celou davku.
# Vznikaji ze spatne dekodovanych tel (napr. nektere JNJ .msg). Nahradime je U+FFFD.
_SURROGATE_RX = re.compile(r"[\ud800-\udfff]")
def _clean_for_pg(s: str) -> str:
if not s:
return ""
s = _CTRL_RX.sub("", s)
if _SURROGATE_RX.search(s):
s = _SURROGATE_RX.sub("", s)
return s
def _truncate(s: str) -> str:
s = _clean_for_pg(s or "")
if not s:
return ""
b = s.encode("utf-8", errors="replace")
if len(b) <= MAX_TEXT_BYTES:
return s
return b[:MAX_TEXT_BYTES].decode("utf-8", errors="ignore")
def html_to_text(html: str) -> str:
if not html:
return ""
try:
soup = BeautifulSoup(html, "lxml")
except Exception:
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "head"]):
tag.decompose()
text = soup.get_text(separator="\n")
lines = [_WS_RX.sub(" ", ln).strip() for ln in text.split("\n")]
text = "\n".join(ln for ln in lines if ln)
text = _NL_RX.sub("\n\n", text)
return text
def fmt_recipients(recipients: list, kind: str) -> str:
if not recipients:
return ""
out = []
for r in recipients:
if not isinstance(r, dict):
continue
if r.get("type") != kind:
continue
name = (r.get("name") or "").strip()
email = (r.get("email") or "").strip()
if name and email:
out.append(f"{name} <{email}>")
elif email:
out.append(email)
elif name:
out.append(name)
return "; ".join(out)
def fmt_attachments(attachments: list) -> str:
if not attachments:
return ""
out = []
for a in attachments[:20]:
if not isinstance(a, dict):
continue
name = a.get("name") or a.get("filename") or ""
if name:
out.append(name)
return " | ".join(out)
def _short(s, n=60):
if not s:
return ""
s = str(s).replace("\n", " ").strip()
return s if len(s) <= n else s[:n] + "..."
def _now() -> datetime:
return datetime.now(tz=timezone.utc)
def _aware_utc(dt: Optional[datetime]) -> Optional[datetime]:
"""Sjednoceni: PG TIMESTAMPTZ -> tz-aware UTC; Mongo datetime -> naive (UTC).
Vrati tz-aware UTC datetime nebo None."""
if dt is None:
return None
if dt.tzinfo is None:
return dt.replace(tzinfo=timezone.utc)
return dt.astimezone(timezone.utc)
# --- HLAVNI SMYCKA ----------------------------------------------------------
def process_mailbox(pg: psycopg.Connection, mongo_coll, mailbox: str,
limit: Optional[int] = None,
index_reset: bool = False) -> dict:
# --index-reset: smaz vse pro tuto schranku v PG
if index_reset:
with pg.cursor() as cur:
cur.execute("DELETE FROM emails WHERE mailbox = %s", (mailbox,))
deleted = cur.rowcount
pg.commit()
print(f"[{mailbox}] --index-reset: smazano {deleted} radku v PG")
# existujici zaznamy v PG (rychly inkrementalni lookup)
# tuple = (extractor_version, ok, body_source)
with pg.cursor() as cur:
cur.execute(
"SELECT message_id, extractor_version, ok, body_source "
"FROM emails WHERE mailbox = %s",
(mailbox,),
)
existing = {row[0]: (row[1], row[2], row[3]) for row in cur.fetchall()}
mongo_total = mongo_coll.estimated_document_count()
pg_total = len(existing)
pg_uptodate = sum(1 for v in existing.values()
if v[0] == EXTRACTOR_VERSION and v[1])
to_process_estimate = mongo_total - pg_uptodate
print(f"\n========== {mailbox} ==========")
print(f" v Mongu: {mongo_total}")
print(f" v PG: {pg_total} (z toho ext_v={EXTRACTOR_VERSION} & ok=true: {pg_uptodate})")
print(f" k zpracovani: ~{to_process_estimate}{' (limit=' + str(limit) + ')' if limit else ''}")
if to_process_estimate <= 0 and not index_reset and not limit:
print(" Nic noveho ke zpracovani.")
return {"mailbox": mailbox, "processed": 0, "ok": 0, "errors": 0,
"skipped": pg_uptodate, "empty_body": 0}
proj = {
"_id": 1, "graph_id": 1, "conversation_id": 1, "folder_path": 1,
"subject": 1, "sender": 1, "recipients": 1,
"sent_at": 1, "received_at": 1, "modified_at": 1,
"is_read": 1, "is_draft": 1,
"has_attachments": 1, "attachment_count": 1, "attachments": 1,
"body_html": 1, "body_text": 1, "body_preview": 1,
"smime_unwrapped": 1, "smime_body_text": 1, "smime_body_html": 1,
"smime_subject": 1, "smime_inner_attachments": 1,
}
cursor = mongo_coll.find({}, proj, no_cursor_timeout=True)
if limit:
cursor = cursor.limit(limit)
processed = ok = errors = skipped = empty_body = 0
queue: list[dict] = []
n = 0
try:
for doc in cursor:
n += 1
msg_id = doc.get("_id") or ""
prev = existing.get(msg_id) # (extractor_version, ok, body_source)
mongo_mtime = doc.get("modified_at")
# Skip kdyz PG ma stejnou EV a ok=true.
# Vyjimka: smime_unwrapped v Mongu, ale PG body_source != 'smime'
# -> unwrap_smime pridal rozbaleny text az po enrichu -> re-enrich.
if prev and prev[0] == EXTRACTOR_VERSION and prev[1]:
needs_smime_reindex = (
bool(doc.get("smime_unwrapped"))
and prev[2] != "smime"
)
if not needs_smime_reindex:
skipped += 1
continue
sender = doc.get("sender") or {}
recipients = doc.get("recipients") or []
attachments = doc.get("attachments") or []
inner = doc.get("smime_inner_attachments") or []
if inner:
attachments = list(attachments) + [
{"filename": (a.get("filename") or "") + " [smime]"}
for a in inner if a.get("filename")
]
row = {
"mailbox": mailbox,
"message_id": msg_id,
"graph_id": doc.get("graph_id"),
"conversation_id": doc.get("conversation_id"),
"folder_path": doc.get("folder_path"),
"subject": doc.get("subject") or "",
"sender_email": sender.get("email"),
"sender_name": sender.get("name"),
"to_addrs": fmt_recipients(recipients, "to"),
"cc_addrs": fmt_recipients(recipients, "cc"),
"bcc_addrs": fmt_recipients(recipients, "bcc"),
# Vsechny timestampy z Monga jsou naive ale interpretovany jako UTC.
# Tagneme je tz-aware aby PG TIMESTAMPTZ ulozil spravnou UTC hodnotu
# a nepocital posun podle session timezone.
"sent_at": _aware_utc(doc.get("sent_at")),
"received_at": _aware_utc(doc.get("received_at")),
"modified_at": _aware_utc(mongo_mtime),
"is_read": doc.get("is_read"),
"is_draft": doc.get("is_draft"),
"has_attachments": doc.get("has_attachments"),
"attachment_count": doc.get("attachment_count"),
"attachments_summary": fmt_attachments(attachments),
"body": None,
"body_length": 0,
"body_source": "empty",
"extracted_at": _now(),
"extractor_version": EXTRACTOR_VERSION,
"ok": False,
"error": None,
}
status = "OK "; detail = ""
try:
text = ""
if doc.get("smime_unwrapped"):
s_text = doc.get("smime_body_text") or ""
s_html = doc.get("smime_body_html") or ""
s_html_text = html_to_text(s_html) if s_html else ""
combined = "\n\n".join(p for p in (s_text, s_html_text) if p)
s_subject = doc.get("smime_subject") or ""
if s_subject:
combined = f"Subject: {s_subject}\n\n{combined}"
if combined:
text = combined
row["body_source"] = "smime"
if not text:
html = doc.get("body_html") or ""
h_text = html_to_text(html) if html else ""
if h_text:
text = h_text
row["body_source"] = "html"
if not text:
plain = doc.get("body_text") or ""
if plain:
text = plain
row["body_source"] = "text"
if not text:
preview = doc.get("body_preview") or ""
if preview:
text = preview
row["body_source"] = "preview"
if not text:
row["body_source"] = "empty"
empty_body += 1
body = _truncate(text)
row["body"] = body if body else None
row["body_length"] = len(body)
row["ok"] = True
ok += 1
detail = f"{len(body)} znaku {_short(body, 60)!r}"
except Exception as e:
row["error"] = f"{type(e).__name__}: {e}"[:500]
status = "ERR"; detail = row["error"][:80]; errors += 1
queue.append(row)
processed += 1
if processed % 200 == 0 or processed == 1:
subj = _short(row["subject"], 50)
print(f" [{n:>6}|p={processed:>5}] {status} {row['body_source']:<7} "
f"{row['body_length']:>7}ch | {subj}", flush=True)
if len(queue) >= BATCH_SIZE:
_flush(pg, queue); queue.clear()
finally:
cursor.close()
if queue:
_flush(pg, queue)
return {"mailbox": mailbox, "processed": processed, "ok": ok,
"errors": errors, "skipped": skipped, "empty_body": empty_body}
UPSERT_SQL = """
INSERT INTO emails
(mailbox, message_id, graph_id, conversation_id, folder_path,
subject, sender_email, sender_name, to_addrs, cc_addrs, bcc_addrs,
sent_at, received_at, modified_at, is_read, is_draft,
has_attachments, attachment_count, attachments_summary,
body, body_length, body_source,
extracted_at, extractor_version, ok, error)
VALUES
(%(mailbox)s, %(message_id)s, %(graph_id)s, %(conversation_id)s, %(folder_path)s,
%(subject)s, %(sender_email)s, %(sender_name)s, %(to_addrs)s, %(cc_addrs)s, %(bcc_addrs)s,
%(sent_at)s, %(received_at)s, %(modified_at)s, %(is_read)s, %(is_draft)s,
%(has_attachments)s, %(attachment_count)s, %(attachments_summary)s,
%(body)s, %(body_length)s, %(body_source)s,
%(extracted_at)s, %(extractor_version)s, %(ok)s, %(error)s)
ON CONFLICT (mailbox, message_id) DO UPDATE SET
graph_id = EXCLUDED.graph_id,
conversation_id = EXCLUDED.conversation_id,
folder_path = EXCLUDED.folder_path,
subject = EXCLUDED.subject,
sender_email = EXCLUDED.sender_email,
sender_name = EXCLUDED.sender_name,
to_addrs = EXCLUDED.to_addrs,
cc_addrs = EXCLUDED.cc_addrs,
bcc_addrs = EXCLUDED.bcc_addrs,
sent_at = EXCLUDED.sent_at,
received_at = EXCLUDED.received_at,
modified_at = EXCLUDED.modified_at,
is_read = EXCLUDED.is_read,
is_draft = EXCLUDED.is_draft,
has_attachments = EXCLUDED.has_attachments,
attachment_count = EXCLUDED.attachment_count,
attachments_summary = EXCLUDED.attachments_summary,
body = EXCLUDED.body,
body_length = EXCLUDED.body_length,
body_source = EXCLUDED.body_source,
extracted_at = EXCLUDED.extracted_at,
extractor_version = EXCLUDED.extractor_version,
ok = EXCLUDED.ok,
error = EXCLUDED.error
"""
def _flush(pg: psycopg.Connection, rows: list[dict]) -> None:
for r in rows:
for k in ("subject", "sender_email", "sender_name", "to_addrs", "cc_addrs",
"bcc_addrs", "attachments_summary", "body", "error", "folder_path"):
if r.get(k):
r[k] = _clean_for_pg(r[k])
with pg.cursor() as cur:
cur.executemany(UPSERT_SQL, rows)
pg.commit()
def discover_mailboxes(db) -> list[str]:
out = []
for name in sorted(db.list_collection_names()):
if name in NON_MAILBOX_COLLECTIONS:
continue
out.append(name)
return out
def main() -> int:
ap = argparse.ArgumentParser(description="enrich_fulltext_emails v1.3")
ap.add_argument("--mailbox", default="",
help="Jedna konkretni schranka. Bez argumentu projede vsechny.")
ap.add_argument("--limit", type=int,
help="Limit emailu na schranku (test)")
ap.add_argument("--index-reset", action="store_true",
help="Pred zpracovanim schranky vymaze vsechny jeji emaily z PG "
"(force re-extract). Bez --mailbox SMAZE CELY index.")
args = ap.parse_args()
t0 = time.time()
print(f"=== enrich_fulltext_emails v1.3 ===")
print(f"Start: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\nPripojuji se k PostgreSQL...")
pg = psycopg.connect(PG_DSN, connect_timeout=10)
with pg.cursor() as cur:
cur.execute(SCHEMA_SQL)
pg.commit()
print(" Schema OK.")
print("Pripojuji se k MongoDB...")
mongo = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
mongo.admin.command("ping")
db = mongo[MONGO_DB]
print(" MongoDB OK.")
if args.mailbox:
mailboxes = [args.mailbox]
else:
mailboxes = discover_mailboxes(db)
print(f"\nSchranky ke zpracovani ({len(mailboxes)}):")
for mb in mailboxes:
print(f" - {mb}")
if args.index_reset and not args.mailbox:
print(f"\n!!! --index-reset bez --mailbox => SMAZE CELY INDEX ({len(mailboxes)} schranek) !!!")
results = []
for mb in mailboxes:
try:
results.append(process_mailbox(pg, db[mb], mb,
limit=args.limit,
index_reset=args.index_reset))
except Exception as e:
traceback.print_exc()
print(f" FATAL pri zpracovani {mb}: {e}")
results.append({"mailbox": mb, "processed": 0, "ok": 0,
"errors": 1, "skipped": 0, "empty_body": 0})
pg.close()
print("\n" + "="*60)
print("=== SHRNUTI ===")
grand = {"processed": 0, "ok": 0, "errors": 0, "skipped": 0, "empty_body": 0}
for r in results:
print(f" {r['mailbox']:40} processed={r['processed']:>5} ok={r['ok']:>5} "
f"errors={r['errors']:>3} skipped={r['skipped']:>6} empty={r['empty_body']:>4}")
for k in grand:
grand[k] += r.get(k, 0)
print(f" {'TOTAL':40} processed={grand['processed']:>5} ok={grand['ok']:>5} "
f"errors={grand['errors']:>3} skipped={grand['skipped']:>6} empty={grand['empty_body']:>4}")
print(f"\nCelkem trvalo: {time.time() - t0:.1f} s")
print(f"Konec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
# exit code: 0 jen kdyz vsechny schranky probehly bez chyby
return 1 if grand["errors"] > 0 else 0
if __name__ == "__main__":
try:
raise SystemExit(main())
except KeyboardInterrupt:
print("\nPreruseno uzivatelem")
except Exception:
traceback.print_exc()
sys.exit(1)
@@ -0,0 +1,587 @@
"""
==============================================================================
Skript: enrich_fulltext_emails_v1.4.py
Verze: 1.4
Datum: 2026-06-10
Autor: vladimir.buzalka
Zmeny v1.4 (2026-06-10):
- Bugfix: NON_MAILBOX_COLLECTIONS rozsireno o "jnj_messages" a
"jnj_sync_state" (pomocne kolekce JNJ folder trackingu). Predtim je
discover_mailboxes bral jako schranky (jiny schema dokumentu) ->
errors=1 -> cely krok 5 FAIL(1) pri kazdem behu pipeline.
Popis:
Vytahne plny text z emailu ulozenych v MongoDB (db: emaily) a ulozi ho do
PostgreSQL (db: MongoEmaily, tabulka: emails) s GIN tsvector indexem.
Emaily se NESTAHUJI znovu - tela uz jsou v Mongo z parse_emails_graph_v1.4
(a refetch_text_bodies_v1.0 pro stare plain-text emaily).
Tento skript jen vybere prvni dostupne telo a posle text do PG na fulltext.
Zmeny v1.3.1 (2026-06-09):
- Bugfix: _clean_for_pg nahrazuje osamocene surrogate (\\ud800-\\udfff) za U+FFFD.
Drive jeden mail se surrogaty (napr. JNJ .msg) shodil celou davku a krok 5
skoncil FAIL. EXTRACTOR_VERSION zustava 1.2 (neni zmena fallback logiky).
Zmeny v1.3 vs v1.2:
- Bugfix: NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state"}
(sync_state pribyla v delta syncu, predtim ji v1.2 brala jako mailbox).
- --index-reset: pred zpracovanim schranky vymaze vsechny jeji emaily z PG
(force re-extract; pouzij kdyz povysis EXTRACTOR_VERSION nebo chces ciste).
- Vylepseny header per-mailbox: ukaze pocet v Mongu, v PG a k zpracovani.
Zmeny v1.2 vs v1.1:
- S/MIME emaily: pokud unwrap_smime_v1.0 ulozil smime_body_text/smime_body_html,
pouzije se PREFEROVANE pred bezvyznamnym wrapper telem.
- body_source: nova hodnota "smime".
- EXTRACTOR_VERSION=1.2 -> vsechny existujici emaily v PG se preparsuji.
Zmeny v1.1 vs v1.0:
- Fallback poradi rozsireno o body_text.
- body_source umi novou hodnotu "text" (plne plain-text telo, max 2 MB).
Zdroj:
MongoDB 192.168.1.76 db=emaily kolekce=<mailbox>
(krome NON_MAILBOX_COLLECTIONS)
Cil:
PostgreSQL 192.168.1.76 db=MongoEmaily tabulka=emails
tsvector config 'soubory' (sdileny - simple + unaccent)
Inkrementalita:
Pokud (mailbox, message_id) jiz existuje a extractor_version je aktualni
a modified_at v Mongo neni novejsi -> skip. Pri zmene verze extractoru
se vse preparsuje. --index-reset to obejde a smaze PG pred behom.
Spusteni:
python enrich_fulltext_emails_v1.4.py # vsechny schranky
python enrich_fulltext_emails_v1.4.py --mailbox ordinace@buzalkova.cz
python enrich_fulltext_emails_v1.4.py --limit 500 # test
python enrich_fulltext_emails_v1.4.py --mailbox X --index-reset # smaze PG schranky a re-extrahuje vsechno
python enrich_fulltext_emails_v1.4.py --index-reset # smaze CELY index a postavi znovu (POMALE!)
==============================================================================
"""
from __future__ import annotations
import argparse
import re
import sys
import time
import traceback
from datetime import datetime, timezone
from typing import Optional
import psycopg
from bs4 import BeautifulSoup
from pymongo import MongoClient
# --- konfigurace ------------------------------------------------------------
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
PG_DSN = ("host=192.168.1.76 port=5432 dbname=MongoEmaily "
"user=vladimir.buzalka password=Vlado7309208104++")
EXTRACTOR_VERSION = "1.2" # NEMENIT pokud nemenis fallback logiku!
MAX_TEXT_BYTES = 5 * 1024 * 1024 # plain text max 5 MB
# Kolekce v `emaily` ktere NEJSOU mailboxy (nezpracovavame)
# (jnj_messages + jnj_sync_state = pomocne kolekce JNJ folder trackingu)
NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state",
"jnj_messages", "jnj_sync_state"}
BATCH_SIZE = 100
# --- SCHEMA -----------------------------------------------------------------
SCHEMA_SQL = """
CREATE EXTENSION IF NOT EXISTS unaccent;
CREATE EXTENSION IF NOT EXISTS pg_trgm;
DO $$
BEGIN
IF NOT EXISTS (SELECT 1 FROM pg_ts_config WHERE cfgname = 'soubory') THEN
CREATE TEXT SEARCH CONFIGURATION soubory ( COPY = simple );
ALTER TEXT SEARCH CONFIGURATION soubory
ALTER MAPPING FOR hword, hword_part, word
WITH unaccent, simple;
END IF;
END$$;
CREATE TABLE IF NOT EXISTS emails (
id BIGSERIAL PRIMARY KEY,
mailbox TEXT NOT NULL,
message_id TEXT NOT NULL,
graph_id TEXT,
conversation_id TEXT,
folder_path TEXT,
subject TEXT,
sender_email TEXT,
sender_name TEXT,
to_addrs TEXT,
cc_addrs TEXT,
bcc_addrs TEXT,
sent_at TIMESTAMPTZ,
received_at TIMESTAMPTZ,
modified_at TIMESTAMPTZ,
is_read BOOLEAN,
is_draft BOOLEAN,
has_attachments BOOLEAN,
attachment_count INT,
attachments_summary TEXT,
body TEXT,
body_length INT,
body_source TEXT, -- 'html' | 'preview' | 'empty'
tsv tsvector GENERATED ALWAYS AS (
to_tsvector('soubory'::regconfig,
left(
coalesce(subject, '') || ' ' ||
coalesce(sender_email, '') || ' ' ||
coalesce(sender_name, '') || ' ' ||
coalesce(to_addrs, '') || ' ' ||
coalesce(cc_addrs, '') || ' ' ||
coalesce(attachments_summary, '') || ' ' ||
coalesce(body, ''),
800000)
)
) STORED,
extracted_at TIMESTAMPTZ DEFAULT now(),
extractor_version TEXT,
ok BOOLEAN,
error TEXT,
UNIQUE (mailbox, message_id)
);
CREATE INDEX IF NOT EXISTS emails_tsv_gin ON emails USING gin(tsv);
CREATE INDEX IF NOT EXISTS emails_subject_trgm ON emails USING gin(subject gin_trgm_ops);
CREATE INDEX IF NOT EXISTS emails_sender_email_idx ON emails(sender_email);
CREATE INDEX IF NOT EXISTS emails_mailbox_idx ON emails(mailbox);
CREATE INDEX IF NOT EXISTS emails_received_idx ON emails(received_at DESC);
CREATE INDEX IF NOT EXISTS emails_conv_idx ON emails(conversation_id);
"""
# --- HELPERY ----------------------------------------------------------------
_CTRL_RX = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f]")
_WS_RX = re.compile(r"[ \t]+")
_NL_RX = re.compile(r"\n{3,}")
# Osamocene surrogate (\ud800-\udfff) jsou neplatne v UTF-8 -> psycopg pri zapisu
# vyhodi UnicodeEncodeError ("surrogates not allowed") a shodi celou davku.
# Vznikaji ze spatne dekodovanych tel (napr. nektere JNJ .msg). Nahradime je U+FFFD.
_SURROGATE_RX = re.compile(r"[\ud800-\udfff]")
def _clean_for_pg(s: str) -> str:
if not s:
return ""
s = _CTRL_RX.sub("", s)
if _SURROGATE_RX.search(s):
s = _SURROGATE_RX.sub("", s)
return s
def _truncate(s: str) -> str:
s = _clean_for_pg(s or "")
if not s:
return ""
b = s.encode("utf-8", errors="replace")
if len(b) <= MAX_TEXT_BYTES:
return s
return b[:MAX_TEXT_BYTES].decode("utf-8", errors="ignore")
def html_to_text(html: str) -> str:
if not html:
return ""
try:
soup = BeautifulSoup(html, "lxml")
except Exception:
soup = BeautifulSoup(html, "html.parser")
for tag in soup(["script", "style", "head"]):
tag.decompose()
text = soup.get_text(separator="\n")
lines = [_WS_RX.sub(" ", ln).strip() for ln in text.split("\n")]
text = "\n".join(ln for ln in lines if ln)
text = _NL_RX.sub("\n\n", text)
return text
def fmt_recipients(recipients: list, kind: str) -> str:
if not recipients:
return ""
out = []
for r in recipients:
if not isinstance(r, dict):
continue
if r.get("type") != kind:
continue
name = (r.get("name") or "").strip()
email = (r.get("email") or "").strip()
if name and email:
out.append(f"{name} <{email}>")
elif email:
out.append(email)
elif name:
out.append(name)
return "; ".join(out)
def fmt_attachments(attachments: list) -> str:
if not attachments:
return ""
out = []
for a in attachments[:20]:
if not isinstance(a, dict):
continue
name = a.get("name") or a.get("filename") or ""
if name:
out.append(name)
return " | ".join(out)
def _short(s, n=60):
if not s:
return ""
s = str(s).replace("\n", " ").strip()
return s if len(s) <= n else s[:n] + "..."
def _now() -> datetime:
return datetime.now(tz=timezone.utc)
def _aware_utc(dt: Optional[datetime]) -> Optional[datetime]:
"""Sjednoceni: PG TIMESTAMPTZ -> tz-aware UTC; Mongo datetime -> naive (UTC).
Vrati tz-aware UTC datetime nebo None."""
if dt is None:
return None
if dt.tzinfo is None:
return dt.replace(tzinfo=timezone.utc)
return dt.astimezone(timezone.utc)
# --- HLAVNI SMYCKA ----------------------------------------------------------
def process_mailbox(pg: psycopg.Connection, mongo_coll, mailbox: str,
limit: Optional[int] = None,
index_reset: bool = False) -> dict:
# --index-reset: smaz vse pro tuto schranku v PG
if index_reset:
with pg.cursor() as cur:
cur.execute("DELETE FROM emails WHERE mailbox = %s", (mailbox,))
deleted = cur.rowcount
pg.commit()
print(f"[{mailbox}] --index-reset: smazano {deleted} radku v PG")
# existujici zaznamy v PG (rychly inkrementalni lookup)
# tuple = (extractor_version, ok, body_source)
with pg.cursor() as cur:
cur.execute(
"SELECT message_id, extractor_version, ok, body_source "
"FROM emails WHERE mailbox = %s",
(mailbox,),
)
existing = {row[0]: (row[1], row[2], row[3]) for row in cur.fetchall()}
mongo_total = mongo_coll.estimated_document_count()
pg_total = len(existing)
pg_uptodate = sum(1 for v in existing.values()
if v[0] == EXTRACTOR_VERSION and v[1])
to_process_estimate = mongo_total - pg_uptodate
print(f"\n========== {mailbox} ==========")
print(f" v Mongu: {mongo_total}")
print(f" v PG: {pg_total} (z toho ext_v={EXTRACTOR_VERSION} & ok=true: {pg_uptodate})")
print(f" k zpracovani: ~{to_process_estimate}{' (limit=' + str(limit) + ')' if limit else ''}")
if to_process_estimate <= 0 and not index_reset and not limit:
print(" Nic noveho ke zpracovani.")
return {"mailbox": mailbox, "processed": 0, "ok": 0, "errors": 0,
"skipped": pg_uptodate, "empty_body": 0}
proj = {
"_id": 1, "graph_id": 1, "conversation_id": 1, "folder_path": 1,
"subject": 1, "sender": 1, "recipients": 1,
"sent_at": 1, "received_at": 1, "modified_at": 1,
"is_read": 1, "is_draft": 1,
"has_attachments": 1, "attachment_count": 1, "attachments": 1,
"body_html": 1, "body_text": 1, "body_preview": 1,
"smime_unwrapped": 1, "smime_body_text": 1, "smime_body_html": 1,
"smime_subject": 1, "smime_inner_attachments": 1,
}
cursor = mongo_coll.find({}, proj, no_cursor_timeout=True)
if limit:
cursor = cursor.limit(limit)
processed = ok = errors = skipped = empty_body = 0
queue: list[dict] = []
n = 0
try:
for doc in cursor:
n += 1
msg_id = doc.get("_id") or ""
prev = existing.get(msg_id) # (extractor_version, ok, body_source)
mongo_mtime = doc.get("modified_at")
# Skip kdyz PG ma stejnou EV a ok=true.
# Vyjimka: smime_unwrapped v Mongu, ale PG body_source != 'smime'
# -> unwrap_smime pridal rozbaleny text az po enrichu -> re-enrich.
if prev and prev[0] == EXTRACTOR_VERSION and prev[1]:
needs_smime_reindex = (
bool(doc.get("smime_unwrapped"))
and prev[2] != "smime"
)
if not needs_smime_reindex:
skipped += 1
continue
sender = doc.get("sender") or {}
recipients = doc.get("recipients") or []
attachments = doc.get("attachments") or []
inner = doc.get("smime_inner_attachments") or []
if inner:
attachments = list(attachments) + [
{"filename": (a.get("filename") or "") + " [smime]"}
for a in inner if a.get("filename")
]
row = {
"mailbox": mailbox,
"message_id": msg_id,
"graph_id": doc.get("graph_id"),
"conversation_id": doc.get("conversation_id"),
"folder_path": doc.get("folder_path"),
"subject": doc.get("subject") or "",
"sender_email": sender.get("email"),
"sender_name": sender.get("name"),
"to_addrs": fmt_recipients(recipients, "to"),
"cc_addrs": fmt_recipients(recipients, "cc"),
"bcc_addrs": fmt_recipients(recipients, "bcc"),
# Vsechny timestampy z Monga jsou naive ale interpretovany jako UTC.
# Tagneme je tz-aware aby PG TIMESTAMPTZ ulozil spravnou UTC hodnotu
# a nepocital posun podle session timezone.
"sent_at": _aware_utc(doc.get("sent_at")),
"received_at": _aware_utc(doc.get("received_at")),
"modified_at": _aware_utc(mongo_mtime),
"is_read": doc.get("is_read"),
"is_draft": doc.get("is_draft"),
"has_attachments": doc.get("has_attachments"),
"attachment_count": doc.get("attachment_count"),
"attachments_summary": fmt_attachments(attachments),
"body": None,
"body_length": 0,
"body_source": "empty",
"extracted_at": _now(),
"extractor_version": EXTRACTOR_VERSION,
"ok": False,
"error": None,
}
status = "OK "; detail = ""
try:
text = ""
if doc.get("smime_unwrapped"):
s_text = doc.get("smime_body_text") or ""
s_html = doc.get("smime_body_html") or ""
s_html_text = html_to_text(s_html) if s_html else ""
combined = "\n\n".join(p for p in (s_text, s_html_text) if p)
s_subject = doc.get("smime_subject") or ""
if s_subject:
combined = f"Subject: {s_subject}\n\n{combined}"
if combined:
text = combined
row["body_source"] = "smime"
if not text:
html = doc.get("body_html") or ""
h_text = html_to_text(html) if html else ""
if h_text:
text = h_text
row["body_source"] = "html"
if not text:
plain = doc.get("body_text") or ""
if plain:
text = plain
row["body_source"] = "text"
if not text:
preview = doc.get("body_preview") or ""
if preview:
text = preview
row["body_source"] = "preview"
if not text:
row["body_source"] = "empty"
empty_body += 1
body = _truncate(text)
row["body"] = body if body else None
row["body_length"] = len(body)
row["ok"] = True
ok += 1
detail = f"{len(body)} znaku {_short(body, 60)!r}"
except Exception as e:
row["error"] = f"{type(e).__name__}: {e}"[:500]
status = "ERR"; detail = row["error"][:80]; errors += 1
queue.append(row)
processed += 1
if processed % 200 == 0 or processed == 1:
subj = _short(row["subject"], 50)
print(f" [{n:>6}|p={processed:>5}] {status} {row['body_source']:<7} "
f"{row['body_length']:>7}ch | {subj}", flush=True)
if len(queue) >= BATCH_SIZE:
_flush(pg, queue); queue.clear()
finally:
cursor.close()
if queue:
_flush(pg, queue)
return {"mailbox": mailbox, "processed": processed, "ok": ok,
"errors": errors, "skipped": skipped, "empty_body": empty_body}
UPSERT_SQL = """
INSERT INTO emails
(mailbox, message_id, graph_id, conversation_id, folder_path,
subject, sender_email, sender_name, to_addrs, cc_addrs, bcc_addrs,
sent_at, received_at, modified_at, is_read, is_draft,
has_attachments, attachment_count, attachments_summary,
body, body_length, body_source,
extracted_at, extractor_version, ok, error)
VALUES
(%(mailbox)s, %(message_id)s, %(graph_id)s, %(conversation_id)s, %(folder_path)s,
%(subject)s, %(sender_email)s, %(sender_name)s, %(to_addrs)s, %(cc_addrs)s, %(bcc_addrs)s,
%(sent_at)s, %(received_at)s, %(modified_at)s, %(is_read)s, %(is_draft)s,
%(has_attachments)s, %(attachment_count)s, %(attachments_summary)s,
%(body)s, %(body_length)s, %(body_source)s,
%(extracted_at)s, %(extractor_version)s, %(ok)s, %(error)s)
ON CONFLICT (mailbox, message_id) DO UPDATE SET
graph_id = EXCLUDED.graph_id,
conversation_id = EXCLUDED.conversation_id,
folder_path = EXCLUDED.folder_path,
subject = EXCLUDED.subject,
sender_email = EXCLUDED.sender_email,
sender_name = EXCLUDED.sender_name,
to_addrs = EXCLUDED.to_addrs,
cc_addrs = EXCLUDED.cc_addrs,
bcc_addrs = EXCLUDED.bcc_addrs,
sent_at = EXCLUDED.sent_at,
received_at = EXCLUDED.received_at,
modified_at = EXCLUDED.modified_at,
is_read = EXCLUDED.is_read,
is_draft = EXCLUDED.is_draft,
has_attachments = EXCLUDED.has_attachments,
attachment_count = EXCLUDED.attachment_count,
attachments_summary = EXCLUDED.attachments_summary,
body = EXCLUDED.body,
body_length = EXCLUDED.body_length,
body_source = EXCLUDED.body_source,
extracted_at = EXCLUDED.extracted_at,
extractor_version = EXCLUDED.extractor_version,
ok = EXCLUDED.ok,
error = EXCLUDED.error
"""
def _flush(pg: psycopg.Connection, rows: list[dict]) -> None:
for r in rows:
for k in ("subject", "sender_email", "sender_name", "to_addrs", "cc_addrs",
"bcc_addrs", "attachments_summary", "body", "error", "folder_path"):
if r.get(k):
r[k] = _clean_for_pg(r[k])
with pg.cursor() as cur:
cur.executemany(UPSERT_SQL, rows)
pg.commit()
def discover_mailboxes(db) -> list[str]:
out = []
for name in sorted(db.list_collection_names()):
if name in NON_MAILBOX_COLLECTIONS:
continue
out.append(name)
return out
def main() -> int:
ap = argparse.ArgumentParser(description="enrich_fulltext_emails v1.4")
ap.add_argument("--mailbox", default="",
help="Jedna konkretni schranka. Bez argumentu projede vsechny.")
ap.add_argument("--limit", type=int,
help="Limit emailu na schranku (test)")
ap.add_argument("--index-reset", action="store_true",
help="Pred zpracovanim schranky vymaze vsechny jeji emaily z PG "
"(force re-extract). Bez --mailbox SMAZE CELY index.")
args = ap.parse_args()
t0 = time.time()
print(f"=== enrich_fulltext_emails v1.4 ===")
print(f"Start: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("\nPripojuji se k PostgreSQL...")
pg = psycopg.connect(PG_DSN, connect_timeout=10)
with pg.cursor() as cur:
cur.execute(SCHEMA_SQL)
pg.commit()
print(" Schema OK.")
print("Pripojuji se k MongoDB...")
mongo = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
mongo.admin.command("ping")
db = mongo[MONGO_DB]
print(" MongoDB OK.")
if args.mailbox:
mailboxes = [args.mailbox]
else:
mailboxes = discover_mailboxes(db)
print(f"\nSchranky ke zpracovani ({len(mailboxes)}):")
for mb in mailboxes:
print(f" - {mb}")
if args.index_reset and not args.mailbox:
print(f"\n!!! --index-reset bez --mailbox => SMAZE CELY INDEX ({len(mailboxes)} schranek) !!!")
results = []
for mb in mailboxes:
try:
results.append(process_mailbox(pg, db[mb], mb,
limit=args.limit,
index_reset=args.index_reset))
except Exception as e:
traceback.print_exc()
print(f" FATAL pri zpracovani {mb}: {e}")
results.append({"mailbox": mb, "processed": 0, "ok": 0,
"errors": 1, "skipped": 0, "empty_body": 0})
pg.close()
print("\n" + "="*60)
print("=== SHRNUTI ===")
grand = {"processed": 0, "ok": 0, "errors": 0, "skipped": 0, "empty_body": 0}
for r in results:
print(f" {r['mailbox']:40} processed={r['processed']:>5} ok={r['ok']:>5} "
f"errors={r['errors']:>3} skipped={r['skipped']:>6} empty={r['empty_body']:>4}")
for k in grand:
grand[k] += r.get(k, 0)
print(f" {'TOTAL':40} processed={grand['processed']:>5} ok={grand['ok']:>5} "
f"errors={grand['errors']:>3} skipped={grand['skipped']:>6} empty={grand['empty_body']:>4}")
print(f"\nCelkem trvalo: {time.time() - t0:.1f} s")
print(f"Konec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
# exit code: 0 jen kdyz vsechny schranky probehly bez chyby
return 1 if grand["errors"] > 0 else 0
if __name__ == "__main__":
try:
raise SystemExit(main())
except KeyboardInterrupt:
print("\nPreruseno uzivatelem")
except Exception:
traceback.print_exc()
sys.exit(1)
@@ -0,0 +1,289 @@
# parse_emails_tower_v1.3
## Spuštění
**První spuštění:**
```bash
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.3.py > /scripts/parse_emails_tower.log 2>&1"
```
**Pokračování po přerušení (přeskočí už importované):**
```bash
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.3.py --skip-existing > /scripts/parse_emails_tower.log 2>&1"
```
---
## Stav importu
**Sledování průběhu (live log):**
```bash
docker exec -it python-runner tail -f /scripts/parse_emails_tower.log
```
**Počet emailů v MongoDB:**
```bash
docker exec -it python-runner python -c \
"from pymongo import MongoClient; c=MongoClient('mongodb://192.168.1.76:27017'); print(c['emaily']['vbuzalka@its.jnj.com'].count_documents({}))"
```
---
**Název:** parse_emails_tower_v1.3.py
**Verze:** 1.3
**Datum:** 2026-06-08
**Autor:** vladimir.buzalka
---
## Účel
Import všech `.msg` souborů do MongoDB. Z každého souboru extrahuje **všechny dostupné vlastnosti** — podobně jako EXIF u fotek.
- **DB:** `emaily`
- **Kolekce:** `vbuzalka@its.jnj.com`
- `_id` = Internet Message-ID (nebo `filename:<stem>` jako fallback)
- Bezpečné přerušit a opakovat — upsert podle `_id`
---
## Prostředí
Běží v Docker containeru **python-runner** na **Unraid Tower**.
| Komponenta | Umístění |
|---|---|
| Container | `python-runner` (Docker na Unraid Tower) |
| .msg soubory | `/mnt/user/JNJEMAILS``/mnt/JNJEMAILS` uvnitř containeru |
| Skripty | `/mnt/user/Scripts``/scripts` uvnitř containeru |
| MongoDB | `192.168.1.76:27017` (externí, mimo container) |
---
## Spouštění (z Unraid terminálu)
**Test na 50 emailech:**
```bash
docker exec -it python-runner python /scripts/parse_emails_tower_v1.3.py --limit 50 --no-indexes
```
**Kompletní import na pozadí (log do souboru):**
```bash
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.3.py > /scripts/parse_emails_tower.log 2>&1"
```
**Pokračování po přerušení:**
```bash
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.3.py --skip-existing > /scripts/parse_emails_tower.log 2>&1"
```
**Sledování průběhu (Ctrl+C ukončí sledování, import běží dál):**
```bash
docker exec -it python-runner tail -f /scripts/parse_emails_tower.log
```
### Všechny parametry
| Parametr | Popis |
|---|---|
| `--skip-existing` | Načte seznam hotových souborů z MongoDB a přeskočí je. Použij pro pokračování po přerušení. |
| `--limit N` | Zpracuje jen prvních N souborů. Vhodné pro test. |
| `--no-indexes` | Nevytváří indexy na konci. Použij pokud přerušíš uprostřed — indexy vytvoř ručně až je vše hotové. |
| `--msgs-dir PATH` | Přepíše výchozí cestu k .msg souborům (výchozí: `/mnt/JNJEMAILS`). |
---
## Průběh na konzoli
Každý email na jednom řádku:
```
1/69371 OK RE: Protocol deviation CZ10022 jan.novak@its.jnj.com
2/69371 OK UCO3001: Draft FUL pro DD5-CZ10022 monitor@4gclinical.com
3/69371 ERR ? ?
```
Každých 500 emailů oddělovač s průběhem:
```
────────────────────────────────────────────────────────────────────────────────
Průběh: ok=498 err=2 0.4 msg/s ETA 47h12m
────────────────────────────────────────────────────────────────────────────────
```
Na konci souhrn:
```
====================================================
Vysledek: ok=69300 | skip=0 | err=71
Celkovy cas: 47h 23m 10s
Dokumentu v kolekci: 69300
```
---
## Zdroje dat z každého .msg
| Pole | Popis |
|---|---|
| Předmět, normalized subject | |
| Odesílatel | email, jméno, SMTP adresa |
| Příjemci To/CC/BCC | strukturovaně `[{type, email, name}]` |
| Čas doručení a odeslání | UTC |
| Tělo | plaintext + HTML (max 2 MB) |
| Přílohy | metadata: jméno, velikost, MIME typ, inline flag |
| Internet headers | X-Originating-IP, Received, DKIM, X-Mailer, ... |
| MAPI | důležitost, citlivost, příznak, konverzační vlákno, kategorie |
| In-Reply-To, References | pro rekonstrukci vlákna |
| Raw MAPI properties | `{0xXXXX: value}` |
---
## Hodnotové kódy
| Pole | Hodnota | Význam |
|---|---|---|
| `importance` | 0 | Nízká |
| | 1 | Normální |
| | 2 | Vysoká |
| `sensitivity` | 0 | Normální |
| | 1 | Osobní |
| | 2 | Soukromé |
| | 3 | Důvěrné |
| `flag_status` | 0 | Bez příznaku |
| | 1 | Označeno (follow up) |
| | 2 | Dokončeno |
---
## MongoDB indexy
Automaticky vytvořeny na konci importu (`--no-indexes` přeskočí):
| Index | Pole |
|---|---|
| Chronologický | `received_at`, `sent_at` |
| Odesílatel | `sender.email` |
| Soubor | `filename` (unique) |
| Konverzace | `conversation_topic` |
| Filtry | `has_attachments`, `categories`, `importance`, `flag_status` |
| Full-text | `subject` + `body_text` + `to` + `cc` (text index `text_search`) |
---
## Ukázkové dotazy (MongoDB shell / MCP)
**Emaily o UCO3001 s přílohou:**
```javascript
db["vbuzalka@its.jnj.com"].find({
$text: { $search: "UCO3001" },
has_attachments: true
}).sort({ received_at: -1 })
```
**Emaily od konkrétního odesílatele:**
```javascript
db["vbuzalka@its.jnj.com"].find({
"sender.email": /covance/i
}).sort({ received_at: -1 })
```
**Celé konverzační vlákno:**
```javascript
db["vbuzalka@its.jnj.com"].find({
conversation_topic: "Protocol deviation CZ10022"
}).sort({ received_at: 1 })
```
**Statistiky podle odesílatele (top 20):**
```javascript
db["vbuzalka@its.jnj.com"].aggregate([
{ $group: { _id: "$sender.email", count: { $sum: 1 } } },
{ $sort: { count: -1 } },
{ $limit: 20 }
])
```
---
## Chybový log
Soubory které selhaly jsou zalogovány do **samostatného** `parse_emails_tower_errors.log` vedle skriptu (tj. `/scripts/parse_emails_tower_errors.log``\\tower\Scripts\parse_emails_tower_errors.log`). Tento log je oddělený od Graph importu, aby v něm nebyl bordel:
```
2026-06-08 12:40:33 | open failed [7A3F...0000.msg]: <důvod>
2026-06-08 12:41:02 | per-dokument selhal [_id=<...>]: <důvod>
```
Stdout (průběh) jde do `parse_emails_tower.log` — rovněž samostatný.
---
## Záchrana problémových .msg (v1.3)
Některé `.msg` defaultní `extract_msg` neumí otevřít a celý soubor zahodí, **i když email je naprosto v pořádku** (jde otevřít v Outlooku). Tři příčiny a jejich řešení:
| Příčina | Příklad | Řešení |
|---|---|---|
| Vadná příloha bez `PR_ATTACH_METHOD` | „Attachment method missing" | `errorBehavior=SUPPRESS_ALL` — vadnou přílohu přeskočí, zbytek (tělo, ostatní přílohy) načte |
| Tělo deklaruje codepage 1200 (UTF-16), ale bajty jsou cp1250/gb2312 | české `` místo diakritiky | raw-OLE čtení + kaskádové dekódování |
| Vnořený email (Outlook item) | „not an MSG file", `extract_msg` vrátí prázdno | raw-OLE čtení klíčových MAPI streamů |
**Jak to funguje:**
1. `open_message()` — kaskádové otevření: `normal``SUPPRESS_ALL``+overrideEncoding` (dle codepage property).
2. **raw-OLE fallback** — když extract_msg vrátí prázdno/`` nebo musel hádat kódování, klíčová pole (subject, sender, body, html) se dočtou **přímo z OLE streamů** (`__substg1.0_0037`/`0C1A`/`5D01`/`1000`/`1013`) s kaskádovým dekódováním:
```
utf-8 (strict) → kódování dle CPID → cp1250 → cp1252 → gb2312 → latin-1
```
Hlavičkám o kódování se **nevěří** (často si protiřečí); bere se první kódování, které projde striktně bez chyby. `utf-8 strict` je silný rozlišovač.
**Nová pole v dokumentu:**
| Pole | Význam |
|---|---|
| `parse_mode` | `normal` / `suppress_all` / `override:<enc>` — jak byl soubor otevřen |
| `parse_degraded` | `true` = byl potřeba fallback (vadná příloha nebo hádané kódování) |
**Ověřeno:** všech 126 dříve selhaných souborů z běhu 8.6. se obnoví čistě (74× `suppress_all`, 52× `override:cp1250`), 0 prázdných, 0 s ``.
Dohledání degradovaných:
```javascript
db["vbuzalka@its.jnj.com"].find({ parse_degraded: true })
```
---
## Výkon
| Parametr | Hodnota |
|---|---|
| Počet souborů | ~69 000 |
| Rychlost | ~0.4 msg/s (htmlBody dekódování) |
| Odhadovaný čas | 48 hodin |
| Batch size | 200 dokumentů / bulk_write |
| Odhadovaná velikost DB | 25 GB |
---
## Závislosti (v Docker image python-runner)
```
extract-msg==0.55.0
olefile
pymongo
python-dateutil
```
Image sestaven z `Dockerfile` v `/mnt/user/Scripts/python-runner/`.
---
## Historie verzí
| Verze | Datum | Změna |
|---|---|---|
| 1.0 | 2026-06-01 | Iniciální verze |
| 1.1 | 2026-06-02 | Nasazení na Unraid Tower v Docker containeru python-runner; MSGS_DIR změněno z SMB share (`\\tower\JNJEMAILS`) na lokální mount (`/mnt/JNJEMAILS`); aktualizován popis spouštění pro `docker exec` |
| 1.2 | 2026-06-08 | **Oprava `to_bson`:** int mimo rozsah int64 (BSON umí jen 8-byte ints) se převede na string — dřív celý `bulk_write` spadl na `MongoDB can only handle up to 8-byte ints` a zahodil celou dávku 200 dokumentů (běh v1.1 z 8.6. neuložil **nic**). `flush()` má fallback per-dokument (vadný záznam zahodí sám, ne celou dávku). `bool()` testován před `int()`. Samostatné logy `parse_emails_tower.log` + `parse_emails_tower_errors.log`. |
| 1.3 | 2026-06-08 | **Záchrana dříve selhaných .msg** (cca 126 z běhu 8.6.): `open_message()` kaskádové otevření (`normal`→`SUPPRESS_ALL`→`+overrideEncoding`) řeší vadné přílohy i „not an MSG file"; **raw-OLE fallback** dočítá subject/sender/body/html přímo z OLE streamů s kaskádovým dekódováním (utf-8 strict→CPID→cp1250…), když extract_msg vrátí prázdno/``. Nová pole `parse_mode`, `parse_degraded`. Nová závislost `olefile`. Ověřeno: 126/126 obnoveno čistě. |
@@ -0,0 +1,896 @@
"""
parse_emails_tower_v1.3.py
Nazev: parse_emails_tower_v1.3.py
Verze: 1.3
Datum: 2026-06-08
Autor: vladimir.buzalka
Popis:
Parsuje vsechny .msg soubory z MSGS_DIR a importuje je jako dokumenty
do MongoDB. Z kazdeho souboru extrahuje VSECHNY dostupne vlastnosti —
podobne jako EXIF u fotek:
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
- cas doruceni a odeslani (UTC)
- telo plaintext + HTML (max 2 MB)
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag)
- internet headers (X-Originating-IP, Received, DKIM, ...)
- MAPI vlastnosti: dulezitost, citlivost, priznak, konverzacni vlakno,
kategorie, In-Reply-To, References, ...
- vsechny raw MAPI properties jako {0xXXXX: value}
DB: emaily
Kolekce: vbuzalka@its.jnj.com
_id: Internet Message-ID (nebo "filename:<stem>" jako fallback)
Bezpecne prerusit a opakovat:
- upsert podle _id — duplicity se automaticky prepisi
- --skip-existing nacte seznam hotovych souboru z MongoDB a
preskoci je => pokracovani po preruseni bez ztraty prace
Prostredi:
Bezi v Docker containeru "python-runner" na Unraid Tower.
.msg soubory jsou dostupne jako lokalni disk (volume mount):
/mnt/user/JNJEMAILS -> /mnt/JNJEMAILS (uvnitr containeru)
MongoDB na 192.168.1.76:27017 (externi, bezi mimo container).
Spousteni (z Unraid terminalu):
# Test na 50 emailech:
docker exec -it python-runner python /scripts/parse_emails_tower_v1.3.py --limit 50 --no-indexes
# Kompletni import na pozadi (samostatny log, ne sdileny s Graph importem):
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.3.py > /scripts/parse_emails_tower.log 2>&1"
# Pokracovani po preruseni:
docker exec -d python-runner bash -c \
"python /scripts/parse_emails_tower_v1.3.py --skip-existing > /scripts/parse_emails_tower.log 2>&1"
# Sledovani prubehu:
docker exec -it python-runner tail -f /scripts/parse_emails_tower.log
Vystup na konzoli:
Kazdy email na jednom radku:
<poradi>/<celkem> OK/ERR <predmet 60 znaku> <odesilatel>
Kazych 500 emailu: oddelovac s prubehem, rychlosti a ETA.
Na konci: souhrn ok/skip/err, celkovy cas, pocet dokumentu v kolekci.
Zavislosti (nainstalovane v Docker image python-runner):
extract-msg==0.55.0, olefile, pymongo, python-dateutil
Python 3.12, Linux (Docker container na Unraid Tower)
(olefile je tranzitivni zavislost extract-msg, raw-OLE fallback ji pouziva primo)
Struktura dokumentu v MongoDB:
_id Internet Message-ID (nebo filename: fallback)
filename jmeno .msg souboru (20znakovy hex + .msg)
subject predmet zpravy
normalized_subject predmet bez RE:/FW: prefixu
importance 0=nizka 1=normalni 2=vysoka
sensitivity 0=normalni 1=osobni 2=soukrome 3=duverne
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
read_receipt_requested bool
delivery_receipt_requested bool
has_attachments bool
attachment_count int
message_size_bytes velikost .msg souboru na disku
conversation_topic tema vlakna (PR_CONVERSATION_TOPIC)
conversation_index base64 PR_CONVERSATION_INDEX
in_reply_to Message-ID predchozi zpravy
internet_references [Message-ID] — cela historia vlakna
categories [str] — MAPI kategorie / stitky
read_receipt_requested bool
delivery_receipt_requested bool
received_at datetime UTC — cas doruceni
sent_at datetime UTC — cas odeslani
sender.email emailova adresa odesilatele
sender.name zobrazovane jmeno odesilatele
sender.smtp SMTP adresa (pro interni EX adresy)
to retezec To (tak jak v Outlooku)
cc retezec CC
bcc retezec BCC
display_to PR_DISPLAY_TO (zkraceny seznam)
display_cc PR_DISPLAY_CC
recipients [{type, email, name}] — to/cc/bcc s typy
body_text plain text telo
body_html HTML telo (max 2 MB, None pokud neni)
attachments [{filename, size_bytes, mime_type,
content_id, is_inline}]
headers dict internet headers (lowercase_s_podtrzitky)
mapi dict vsech raw MAPI properties {0xXXXX: value}
parsed_at datetime UTC — cas parsovani
Indexy (vytvoreny automaticky na konci):
received_at, sent_at, sender.email, filename (unique),
conversation_topic, has_attachments, categories, importance,
flag_status, text_search (subject + body_text + to + cc)
Chyby:
Soubory ktere selhaly jsou zalogovany do parse_emails_tower_errors.log
v adresari skriptu (SAMOSTATNY log, oddeleny od Graph importu).
Radek: timestamp | open/extract failed | duvod.
Historie verzi:
1.0 2026-06-01 Inicialni verze
1.1 2026-06-02 Nasazeni na Unraid Tower v Docker containeru python-runner;
MSGS_DIR zmeneno z SMB share na lokalni mount /mnt/JNJEMAILS;
aktualizovany popis spousteni pro docker exec
1.2 2026-06-08 OPRAVA: to_bson prevadi int mimo rozsah int64 na string
(BSON umi jen 8-byte ints) — drive cely bulk_write spadl na
'MongoDB can only handle up to 8-byte ints' a zahodil celou
davku 200 dokumentu (v1.1 beh 8.6. neulozil NIC).
flush() ma fallback per-dokument: vadny zaznam zahodi sam,
ne celou davku. bool() testovan pred int().
Samostatny error log parse_emails_tower_errors.log a
stdout log parse_emails_tower.log (drive sdilene s Graph
importem — bordel v logu).
1.3 2026-06-08 ZACHRANA drive selhavajicich .msg (cca 126 z behu 8.6.):
- open_message(): kaskadove otevreni
normal -> SUPPRESS_ALL (vadne prilohy) -> +overrideEncoding
Resi 'Attachment method missing' i 'not an MSG file'.
- raw-OLE fallback: kdyz extract_msg vrati prazdno/ (vnoreny
email, codepage 1200 lze byt cp1250/gb2312), klicova pole
(subject/sender/body/html) se doctou PRIMO z OLE streamu
s kaskadovym dekodovanim (utf-8 strict -> CPID -> cp1250 ...).
Hlavickam o kodovani se neveri (casto si protireci).
- nova pole: parse_mode (normal/suppress_all/override:ENC),
parse_degraded (bool).
"""
import sys
import re
import logging
import argparse
import base64
import struct
from pathlib import Path
from datetime import datetime, timezone
from typing import Optional
import extract_msg
from extract_msg.enums import ErrorBehavior
import olefile
from dateutil import parser as dtparser
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
if hasattr(sys.stdout, "reconfigure"):
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
MSGS_DIR = Path("/mnt/JNJEMAILS")
MONGO_URI = "mongodb://192.168.1.76:27017"
MONGO_DB = "emaily"
MONGO_COL = "vbuzalka@its.jnj.com"
BATCH_SIZE = 200
LOG_FILE = Path(__file__).parent / "parse_emails_tower_errors.log"
SCRIPT_VERSION = "1.2"
# ──────────────────────────────────────────────────────────────────────────────
logging.basicConfig(
filename=str(LOG_FILE),
level=logging.ERROR,
format="%(asctime)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
encoding="utf-8",
)
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
def safe(obj, *attrs, default=None):
"""Bezpecne cteni atributu — vrati prvni non-None hodnotu."""
for attr in attrs:
try:
val = getattr(obj, attr, None)
if val is None:
continue
if isinstance(val, str) and not val.strip():
continue
return val
except Exception:
continue
return default
def parse_date(raw) -> Optional[datetime]:
"""Libovolny datum -> UTC datetime bez tzinfo (pro MongoDB)."""
if raw is None:
return None
if isinstance(raw, datetime):
if raw.tzinfo:
return raw.astimezone(timezone.utc).replace(tzinfo=None)
return raw
try:
dt = dtparser.parse(str(raw))
if dt.tzinfo:
return dt.astimezone(timezone.utc).replace(tzinfo=None)
return dt
except Exception:
return None
_INT64_MIN, _INT64_MAX = -(2 ** 63), 2 ** 63 - 1
def to_bson(val):
"""Konvertuje hodnotu na BSON-serializovatelny typ.
Pozor: BSON umi jen signed int64. Python ma neomezene integery, takze
velke MAPI hodnoty (PR_CHANGE_KEY, FILETIME, 64-bit handle) mimo rozsah
int64 prevadime na string — jinak cely bulk_write spadne na
'MongoDB can only handle up to 8-byte ints'.
"""
# bool musi byt PRED int (isinstance(True, int) == True)
if isinstance(val, bool):
return val
if isinstance(val, bytes):
return val.hex() if len(val) <= 128 else f"<bytes:{len(val)}>"
if isinstance(val, datetime):
return parse_date(val)
if isinstance(val, int):
return val if _INT64_MIN <= val <= _INT64_MAX else str(val)
if isinstance(val, (str, float, type(None))):
return val
if isinstance(val, list):
return [to_bson(v) for v in val]
try:
iv = int(val)
return iv if _INT64_MIN <= iv <= _INT64_MAX else str(iv)
except Exception:
pass
return str(val)
# ─── Extrakce částí zprávy ────────────────────────────────────────────────────
def extract_headers(msg) -> dict:
headers = {}
try:
hdr = msg.header
if not hdr:
return {}
from email.header import decode_header as _dh
def _decode(v: str) -> str:
try:
parts = _dh(v)
out = ""
for part, enc in parts:
out += part.decode(enc or "utf-8", errors="replace") if isinstance(part, bytes) else part
return out
except Exception:
return v
for key in set(hdr.keys()):
k = key.lower().replace("-", "_")
vals = [_decode(v) for v in hdr.get_all(key, [])]
headers[k] = vals if len(vals) > 1 else (vals[0] if vals else "")
except Exception as e:
logging.error("extract_headers: %s", e)
return headers
def extract_recipients(msg) -> list:
result = []
type_map = {1: "to", 2: "cc", 3: "bcc"}
try:
for r in msg.recipients:
rtype = getattr(r, "type", 1)
try:
rtype = int(rtype)
except Exception:
try:
rtype = int(rtype.value)
except Exception:
rtype = 1
rec = {
"type": type_map.get(rtype, "to"),
"email": safe(r, "email", default=""),
"name": safe(r, "name", default=""),
}
result.append(rec)
except Exception as e:
logging.error("extract_recipients: %s", e)
return result
def extract_attachments(msg) -> list:
result = []
try:
for att in msg.attachments:
fname = safe(att, "longFilename", "shortFilename", default="")
if not fname:
continue
size = 0
try:
d = att.data
size = len(d) if d else 0
except Exception:
pass
result.append({
"filename": fname,
"size_bytes": size,
"mime_type": safe(att, "mimetype", "mimeType", default="application/octet-stream"),
"content_id": safe(att, "cid", default=None),
"is_inline": bool(safe(att, "isInline", default=False)),
})
except Exception as e:
logging.error("extract_attachments: %s", e)
return result
def extract_mapi_props(msg) -> dict:
"""Vsechny raw MAPI properties jako {0xXXXX: value}."""
result = {}
try:
props = msg.props
if not hasattr(props, "items"):
return {}
for key, prop in props.items():
try:
val = to_bson(prop.value)
prop_id = f"0x{key[:4].upper()}" if len(key) >= 4 else f"0x{key.upper()}"
result[prop_id] = val
except Exception:
pass
except Exception as e:
logging.error("extract_mapi_props: %s", e)
return result
# ─── Tolerantní otevírání a raw-OLE fallback ─────────────────────────────────
#
# Nektere .msg extract_msg neumi: (a) vadna priloha bez PR_ATTACH_METHOD,
# (b) telo deklaruje codepage 1200 (UTF-16) ale bajty jsou cp1250/gb2312,
# (c) vnoreny email ("not an MSG file") — extract_msg vrati prazdne pole.
# Data v souboru ale jsou. Otevreme tolerantne a degradovana textova pole
# docteme PRIMO z OLE streamu s kaskadovym dekodovanim (hlavickam se neveri).
# Windows codepage -> python codec (PR_INTERNET_CPID / PR_MESSAGE_CODEPAGE)
_CPID_TO_CODEC = {
1250: "cp1250", 1251: "cp1251", 1252: "cp1252", 1253: "cp1253",
1254: "cp1254", 1255: "cp1255", 1256: "cp1256", 1257: "cp1257",
1258: "cp1258", 874: "cp874", 932: "shift_jis", 936: "gb2312",
949: "euc_kr", 950: "big5", 65001: "utf-8", 28591: "iso-8859-1",
28592: "iso-8859-2", 20127: "ascii",
}
def _read_u32_prop(ole, propid):
"""Precte 32-bit hodnotu MAPI property z top-level __properties_version1.0."""
try:
data = ole.openstream("__properties_version1.0").read()
except Exception:
return None
body = data[32:] # 32-bajtova hlavicka top-level property streamu
for i in range(0, len(body) - 16 + 1, 16):
rec = body[i:i + 16]
tag = struct.unpack("<I", rec[0:4])[0]
if ((tag >> 16) & 0xFFFF) == propid:
return struct.unpack("<I", rec[8:12])[0]
return None
def _detect_cpid(ole) -> Optional[str]:
"""Codec dle PR_INTERNET_CPID / PR_MESSAGE_CODEPAGE (jako napoveda, ne dogma)."""
for pid in (0x3FDE, 0x3FFD): # INTERNET_CPID, MESSAGE_CODEPAGE
codec = _CPID_TO_CODEC.get(_read_u32_prop(ole, pid))
# utf-8/ascii nejsou dobry hint pro 8-bit stream (casto lzou)
if codec and codec not in ("utf-8", "ascii"):
return codec
return None
def _cascade_decode(raw: bytes, is_unicode: bool, cpid_codec: Optional[str]) -> str:
"""Dekoduje bajty MAPI stringu. Hlavickam se neveri — zkousime striktne
v poradi priorit a vezmeme prvni, co projde bez chyby."""
if not raw:
return ""
if is_unicode: # PT_UNICODE = utf-16-le
try:
return raw.decode("utf-16-le")
except Exception:
return raw.decode("utf-16-le", errors="replace")
order = ["utf-8"] # utf-8 strict = silny rozlisovac
if cpid_codec:
order.append(cpid_codec)
order += ["cp1250", "cp1252", "gb2312", "big5"]
for enc in order:
try:
return raw.decode(enc, errors="strict")
except Exception:
continue
return raw.decode("latin-1", errors="replace") # nikdy nespadne
def _raw_mapi_strings(msg_path: Path) -> dict:
"""Cte klicova textova MAPI pole PRIMO z OLE (mimo extract_msg).
Pouzije se jen kdyz extract_msg vrati degradovane pole."""
out = {"subject": "", "normalized_subject": "", "sender_name": "",
"sender_email": "", "sender_smtp": "", "body_text": "", "body_html": ""}
try:
ole = olefile.OleFileIO(str(msg_path))
except Exception:
return out
try:
cpid = _detect_cpid(ole)
wanted = { # MAPI tag -> klic v out
"0037": "subject", "0E1D": "normalized_subject",
"0C1A": "sender_name", "5D01": "sender_smtp",
"0C1F": "sender_email", "1000": "body_text", "1013": "body_html",
}
prefix = "__substg1.0_"
found = {} # key -> (priorita_typu, hodnota)
for entry in ole.listdir():
if len(entry) != 1: # jen top-level (ne vnorene zpravy)
continue
name = entry[0]
if not name.startswith(prefix):
continue
tag = name[len(prefix):len(prefix) + 4].upper()
key = wanted.get(tag)
if not key:
continue
typ = name[-4:].upper()
prio = {"001F": 3, "001E": 2, "0102": 1}.get(typ, 0)
if prio == 0:
continue
prev = found.get(key)
if prev and prev[0] >= prio: # preferuj unicode > ansi > binarni
continue
try:
raw = ole.openstream(entry).read()
val = _cascade_decode(raw, typ == "001F", cpid)
except Exception:
continue
found[key] = (prio, val)
for key, (_, val) in found.items():
out[key] = val
finally:
ole.close()
return out
def _degraded(s) -> bool:
"""Pole je degradovane: prazdne nebo obsahuje U+FFFD (nahradni znak)."""
return (not s) or ("" in s)
def open_message(msg_path: Path):
"""Kaskadove otevreni .msg -> (msg, mode) nebo (None, None).
normal bezna cesta
suppress_all tolerantni k vadnym prilohum
override:ENC tolerantni + vnuceny encoding dle codepage property
"""
try:
return extract_msg.Message(str(msg_path)), "normal"
except Exception:
pass
try:
return extract_msg.Message(
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL), "suppress_all"
except Exception:
pass
encs = []
try:
ole = olefile.OleFileIO(str(msg_path))
c = _detect_cpid(ole)
ole.close()
if c:
encs.append(c)
except Exception:
pass
for e in encs + ["cp1250", "cp1252"]:
try:
return extract_msg.Message(
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL,
overrideEncoding=e), f"override:{e}"
except Exception:
continue
return None, None
# ─── Hlavní extrakce ─────────────────────────────────────────────────────────
def extract_message(msg_path: Path) -> Optional[dict]:
"""Parsuje jeden .msg soubor -> MongoDB dokument."""
msg, parse_mode = open_message(msg_path)
if msg is None:
logging.error("open failed [%s]: vsechny pokusy o otevreni selhaly", msg_path.name)
return None
try:
# ── Message-ID ────────────────────────────────────────────────
mid = None
for attr in ("messageId", "message_id", "internetMessageId"):
mid = safe(msg, attr)
if mid:
break
if not mid:
mid = f"filename:{msg_path.stem}"
mid = str(mid).strip()
# ── Předmět ───────────────────────────────────────────────────
try:
subject = msg.subject or ""
except Exception:
subject = ""
normalized_subject = safe(msg, "normalizedSubject", "normalized_subject", default="")
# ── Tělo ──────────────────────────────────────────────────────
try:
body_text = msg.body or ""
except Exception:
body_text = ""
body_html = None
try:
bh = msg.htmlBody
if isinstance(bh, bytes):
bh = bh.decode("utf-8", errors="replace")
if bh:
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
except Exception:
pass
# ── Odesílatel ────────────────────────────────────────────────
try:
sender_email = msg.sender or ""
except Exception:
sender_email = ""
sender_name = safe(msg, "senderName", "sender_name", default="")
sender_smtp = safe(msg, "senderSmtpAddress", "sent_representing_smtp_address", default="")
# ── Příjemci ──────────────────────────────────────────────────
recipients = extract_recipients(msg)
try:
to_raw = msg.to or ""
except Exception:
to_raw = ""
try:
cc_raw = msg.cc or ""
except Exception:
cc_raw = ""
try:
bcc_raw = getattr(msg, "bcc", None) or ""
except Exception:
bcc_raw = ""
display_to = safe(msg, "displayTo", "display_to", default="")
display_cc = safe(msg, "displayCc", "display_cc", default="")
# ── Časy ──────────────────────────────────────────────────────
try:
received_at = parse_date(msg.date)
except Exception:
received_at = None
sent_at = None
for attr in ("clientSubmitTime", "client_submit_time", "sentOn"):
v = safe(msg, attr)
if v:
sent_at = parse_date(v)
break
# ── MAPI vlastnosti ───────────────────────────────────────────
importance = 1
try:
v = msg.importance
if v is not None:
importance = int(v)
except Exception:
pass
sensitivity = 0
try:
v = getattr(msg, "sensitivity", None)
if v is not None:
sensitivity = int(v)
except Exception:
pass
flag_status = 0
try:
v = safe(msg, "flagStatus", "flag_status")
if v is not None:
flag_status = int(v)
except Exception:
pass
conversation_topic = safe(msg, "conversationTopic", "conversation_topic", default="")
conversation_index = ""
try:
ci = safe(msg, "conversationIndex", "conversation_index")
if isinstance(ci, bytes):
conversation_index = base64.b64encode(ci).decode()
elif ci:
conversation_index = str(ci)
except Exception:
pass
in_reply_to = safe(msg, "inReplyTo", "in_reply_to", default="")
internet_refs = []
try:
refs = safe(msg, "internetReferences", "internet_references")
if isinstance(refs, list):
internet_refs = refs
elif isinstance(refs, str) and refs:
internet_refs = [r.strip() for r in refs.split() if r.strip()]
except Exception:
pass
categories = []
try:
cats = safe(msg, "categories")
if isinstance(cats, list):
categories = [str(c) for c in cats if c]
elif isinstance(cats, str) and cats:
categories = [c.strip() for c in re.split(r"[;,]", cats) if c.strip()]
except Exception:
pass
read_receipt = bool(safe(msg, "readReceiptRequested", "read_receipt_requested", default=False))
delivery_receipt = bool(safe(msg, "deliveryReceiptRequested", "delivery_receipt_requested", default=False))
# ── Internet headers ──────────────────────────────────────────
headers = extract_headers(msg)
if not in_reply_to:
in_reply_to = headers.get("in_reply_to", "")
if not internet_refs:
refs_str = headers.get("references", "")
if isinstance(refs_str, str) and refs_str:
internet_refs = [r.strip() for r in refs_str.split() if r.strip()]
# ── Přílohy ───────────────────────────────────────────────────
attachments = extract_attachments(msg)
# ── Raw MAPI ──────────────────────────────────────────────────
mapi_raw = extract_mapi_props(msg)
msg.close()
# ── Raw-OLE fallback pro degradovana textova pole ─────────────
# Kdyz extract_msg vratil prazdno/ nebo musel hadat encoding
# (override/suppress), docteme klicova pole primo z OLE streamu
# kaskadovym dekodovanim — spolehlivejsi nez jeden vnuceny encoding.
parse_degraded = parse_mode != "normal"
# v non-normal modu byl encoding hadany -> raw kaskade se veri vic
forced = parse_mode != "normal"
if (forced or _degraded(subject) or _degraded(body_text)
or _degraded(sender_email) or (body_html and "" in body_html)):
raw = _raw_mapi_strings(msg_path)
if raw["subject"] and (forced or _degraded(subject)):
subject = raw["subject"]
if raw["normalized_subject"] and (forced or _degraded(normalized_subject)):
normalized_subject = raw["normalized_subject"]
if raw["body_text"] and (forced or _degraded(body_text)):
body_text = raw["body_text"]
if raw["body_html"] and (forced or not body_html or "" in body_html):
bh = raw["body_html"]
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
if (raw["sender_smtp"] or raw["sender_email"]) and (forced or _degraded(sender_email)):
sender_email = raw["sender_smtp"] or raw["sender_email"]
if raw["sender_name"] and (forced or _degraded(sender_name)):
sender_name = raw["sender_name"]
if raw["sender_smtp"] and not sender_smtp:
sender_smtp = raw["sender_smtp"]
# ── Dokument ──────────────────────────────────────────────────
return {
"_id": mid,
"filename": msg_path.name,
"subject": subject,
"normalized_subject": normalized_subject,
"importance": importance,
"sensitivity": sensitivity,
"flag_status": flag_status,
"read_receipt_requested": read_receipt,
"delivery_receipt_requested": delivery_receipt,
"has_attachments": len(attachments) > 0,
"attachment_count": len(attachments),
"message_size_bytes": msg_path.stat().st_size,
"conversation_topic": conversation_topic,
"conversation_index": conversation_index,
"in_reply_to": in_reply_to,
"internet_references": internet_refs,
"categories": categories,
"received_at": received_at,
"sent_at": sent_at,
"sender": {
"email": sender_email,
"name": sender_name,
"smtp": sender_smtp,
},
"to": to_raw,
"cc": cc_raw,
"bcc": bcc_raw,
"display_to": display_to,
"display_cc": display_cc,
"recipients": recipients,
"body_text": body_text,
"body_html": body_html,
"attachments": attachments,
"headers": headers,
"mapi": mapi_raw,
"parse_mode": parse_mode, # normal / suppress_all / override:ENC
"parse_degraded": parse_degraded, # True = pouzit fallback (vadna priloha/encoding)
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
}
except Exception as e:
logging.error("extract_message failed [%s]: %s", msg_path.name, e)
return None
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
def create_indexes(col):
print(" Vytvarim indexy...")
col.create_index([("received_at", ASCENDING)])
col.create_index([("sent_at", ASCENDING)])
col.create_index([("sender.email", ASCENDING)])
col.create_index([("filename", ASCENDING)], unique=True, sparse=True)
col.create_index([("conversation_topic", ASCENDING)])
col.create_index([("has_attachments", ASCENDING)])
col.create_index([("categories", ASCENDING)])
col.create_index([("importance", ASCENDING)])
col.create_index([("flag_status", ASCENDING)])
col.create_index([
("subject", TEXT),
("body_text", TEXT),
("to", TEXT),
("cc", TEXT),
], name="text_search", default_language="none")
print(" Indexy hotovy.")
# ─── MAIN ─────────────────────────────────────────────────────────────────────
def main():
ap = argparse.ArgumentParser(description=f"parse_emails v{SCRIPT_VERSION}")
ap.add_argument("--msgs-dir", default=str(MSGS_DIR),
help="Cesta k .msg souborum")
ap.add_argument("--limit", type=int, default=0,
help="Zpracovat max N souboru (0 = vse)")
ap.add_argument("--skip-existing", action="store_true",
help="Preskocit soubory ktere jiz jsou v MongoDB (pokracovani)")
ap.add_argument("--no-indexes", action="store_true",
help="Nevytvorit indexy na konci")
args = ap.parse_args()
msgs_dir = Path(args.msgs_dir)
start = datetime.now()
print(f"=== parse_emails v{SCRIPT_VERSION} ===")
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
print(f"Zdroj: {msgs_dir}")
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
# MongoDB
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
try:
client.admin.command("ping")
print(" MongoDB OK")
except Exception as e:
print(f" CHYBA: MongoDB neni dostupna -- {e}")
sys.exit(1)
col = client[MONGO_DB][MONGO_COL]
# Skip existing — nacti seznam uz importovanych souboru
existing: set = set()
if args.skip_existing:
print(" Nacitam existujici zaznamy z MongoDB...")
existing = set(col.distinct("filename"))
print(f" {len(existing)} jiz importovano")
# Scan
print(f"\nSkenuji {msgs_dir} ...")
all_files = sorted(msgs_dir.glob("*.msg"))
if args.limit:
all_files = all_files[:args.limit]
to_process = [f for f in all_files if f.name not in existing]
skipped = len(all_files) - len(to_process)
total = len(to_process)
print(f" Celkem .msg: {len(all_files)}")
print(f" Preskoceno: {skipped}")
print(f" Ke zpracovani: {total}\n")
if total == 0:
print("Neni co importovat.")
client.close()
return
batch = []
ok_count = 0
err_count = 0
def flush():
nonlocal ok_count, err_count
if not batch:
return
try:
col.bulk_write(batch, ordered=False)
except Exception as e:
# Cely batch spadl (typicky jeden vadny dokument). Zkusime
# ho zapsat dokument po dokumentu, aby chyba zahodila jen
# skutecne vadny zaznam, ne celych BATCH_SIZE.
logging.error("bulk_write spadl (%s) -- prepinam na per-dokument", e)
print(f" CHYBA bulk_write: {e} -- zkousim per-dokument")
for op in batch:
try:
col.bulk_write([op], ordered=False)
except Exception as e2:
try:
bad_id = getattr(op, "_filter", {}).get("_id", "?")
except Exception:
bad_id = "?"
logging.error("per-dokument selhal [_id=%s]: %s", bad_id, e2)
print(f" ZAHOZEN _id={bad_id}: {e2}")
ok_count -= 1
err_count += 1
batch.clear()
for i, msg_path in enumerate(to_process, 1):
doc = extract_message(msg_path)
if doc is None:
err_count += 1
else:
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
ok_count += 1
if len(batch) >= BATCH_SIZE:
flush()
# Výpis každého emailu
status = "ERR " if doc is None else "OK "
subject_str = (doc.get("subject") or "")[:60] if doc else "?"
sender_str = (doc.get("sender", {}).get("email") or "")[:40] if doc else "?"
print(f" {i:>6}/{total} {status} {subject_str:<60} {sender_str}")
if i % 500 == 0:
elapsed = (datetime.now() - start).total_seconds()
rate = i / elapsed if elapsed > 0 else 0
eta_s = int((total - i) / rate) if rate > 0 else 0
print(f" {''*80}")
print(f" Průběh: ok={ok_count} err={err_count} "
f"{rate:.1f} msg/s ETA {eta_s//3600}h{(eta_s%3600)//60}m")
print(f" {''*80}")
flush()
elapsed_total = (datetime.now() - start).total_seconds()
print(f"\n{'='*52}")
print(f"Vysledek: ok={ok_count} | skip={skipped} | err={err_count}")
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
print(f"Dokumentu v kolekci: {col.count_documents({})}")
if not args.no_indexes:
print()
create_indexes(col)
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
if err_count:
print(f"Chyby logovany do: {LOG_FILE}")
client.close()
if __name__ == "__main__":
main()
+80
View File
@@ -0,0 +1,80 @@
# jnj_tower_ingest v1.1.0
**Soubor:** `jnj_tower_ingest_v1.1.py`
**Datum:** 2026-06-10
**Autor:** vladimir.buzalka
**Běží:** Docker kontejner `python-runner` na Unraid Tower (192.168.1.76), u MongoDB.
## Co to je
Sjednocený **Tower-side ingest** JNJ e-mailů — tři dříve oddělené části v jednom běhu:
| Fáze | Dříve samostatně | Co dělá |
|---|---|---|
| **1. PARSE** | `parse_emails_tower_v1.3.py` | `.msg` z `/mnt/JNJEMAILS` → dokument v Mongo `emaily."vbuzalka@its.jnj.com"` (tělo, přílohy, hlavičky, MAPI). Inkrementálně přes **mtime watermark** (`jnj_sync_state`/`_id="parse_state"`). |
| **2. SYNC** | `sync_jnj_state_v1.0.py` | nejnovější SQLite (read-only) → zrcadlo `jnj_messages` + doplnění `jnj_folder`/stavu do `emaily`. Watermark `updated_at` + zkratka `last_db`. |
| **3. ENRICH** | `jnj_emails_to_fulltext_v1.0.py` | doindexuje JNJ schránku do **PG fulltextu** zavoláním **sdíleného** `5_enrich_fulltext_emails_vX.Y.py --mailbox vbuzalka@its.jnj.com` (stejný extractor jako Graph pipeline → konzistentní schéma). |
**Pořadí: parse → sync → enrich.** Čerstvě naparsovaný mail dostane v jednom běhu tělo
(parse) + cestu (sync) + fulltext (enrich). Klíč všude = Internet Message-ID = Mongo `_id`.
## Inkrementálnost (cron každých 5 min)
- **PARSE** — jen `.msg` s `mtime > parse_state.last_parse_mtime`. 1. běh = seed dle
filename v Mongu, pak čistě mtime. `--full` reparsuje vše. Indexy jen při full/seed/`--reindex`.
- **SYNC** — watermark `updated_at` + zkratka `last_db` (stejná SQLite → no-op).
- **ENRICH** — spustí se **jen když parse přidal nové dokumenty** (jinak přeskočí — JNJ
stejně enrichuje hlavní Graph pipeline v 6:00/18:00). Verze enrich se **auto-detekuje**
(nejnovější `/scripts/5_enrich_fulltext_emails_v*.py`). `--no-enrich` vypne,
`--enrich-always` vynutí.
Tři nezávislé události (nová `.msg` / nová `.db` / nové doc pro PG) → skript udělá jen to,
co má práci; jinak levný no-op.
## Vztah ke Graph pipeline
Hlavní `0_run_pipeline` (Graph API) zpracovává schránky buzalka.cz a **JNJ přeskakuje**
(`SKIP_MAILBOXES`, žádné API). JNJ řeší tenhle skript přes `.msg`. Obě cesty ústí do téhož
Monga `emaily` a přes **sdílený `5_enrich`** do téhož PG `MongoEmaily.emails`. Servisní
kolekce `jnj_messages` + `jnj_sync_state` jsou v enrich `NON_MAILBOX_COLLECTIONS`
(nejsou schránky → nejdou do PG).
## Argumenty
| Argument | Význam |
|---|---|
| `--dry-run` | nic nezapíše, jen plán všech fází |
| `--full` | parse: reparsuj vše; sync: ignoruj watermark; enrich: vynuť |
| `--limit N` | max N souborů (parse) / řádků (sync) |
| `--reindex` | vynutí indexy po parse |
| `--force` | sync: ignoruj `last_db` |
| `--parse-only` / `--sync-only` / `--enrich-only` | jen daná fáze |
| `--no-enrich` | přeskoč enrich |
| `--enrich-always` | spusť enrich i bez nových dokumentů |
## Spouštění
```bash
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.1.py --dry-run
docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.1.py # cron
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.1.py --enrich-only
```
## Plánování (HOTOVO)
Unraid User Scripts úloha `jnj_state_sync` (cron `*/5 * * * *`) — wrapper s `flock` volá
`docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.1.py`. Loguje jen reálnou
práci/chyby do `/mnt/user/Scripts/logs/jnj_tower_ingest.log`
(grep `Zapisuji|PARSE hotovo|SYNC hotovo|ENRICH hotovo|CHYBA|Traceback`).
## Revert
`jnj_tower_ingest_v1.0.py` (bez enrich) + `parse_emails_tower_v1.3.py` +
`sync_jnj_state_v1.0.py` zůstávají v `/scripts/` jako pojistka. Návrat = přepsat wrapper
zpět. `jnj_emails_to_fulltext` přesunut do Trash (nahrazen fází 3).
## Historie verzí
- **1.0.0** 2026-06-10 — sjednocení parse + sync (mtime watermark, pořadí parse→sync).
- **1.1.0** 2026-06-10 — + fáze ENRICH (sdílený `5_enrich --mailbox`, auto-detekce verze,
jen při nových dokumentech). Nahrazuje `jnj_emails_to_fulltext_v1.0`.
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,55 @@
# -*- coding: utf-8 -*-
# =============================================================================
# Nazev: fix_email_podruhe_v1.0.py
# Verze: 1.0
# Datum: 2026-06-10
# Popis: U center v KROK 1, jejichz STATUS obsahuje "Email odeslán podruhé",
# nahradi tento text za "1. připomínka odeslaná" (= 2. email byl
# fakticky 1. pripominka). Po zapisu spustit classify_krok --apply
# (centra prejdou na KROK 2). Idempotentni.
# Pouziti: python fix_email_podruhe_v1.0.py (dry-run)
# python fix_email_podruhe_v1.0.py --apply (zapise)
# =============================================================================
import os
import sys
from pymongo import MongoClient
MONGO_URI = os.environ.get("MONGO_URI", "mongodb://192.168.1.76:27017")
OLD = "Email odeslán podruhé"
NEW = "1. připomínka odeslaná"
def main():
apply = "--apply" in sys.argv
client = MongoClient(MONGO_URI)
col = client["feasibility"]["investigators"]
docs = list(col.find(
{"KROK": {"$regex": "^1"}, "STATUS": {"$regex": "odeslán podruhé"}},
{"prijmeni": 1, "jmeno": 1, "STATUS": 1},
))
print(f"Nalezeno {len(docs)} center v KROK 1 s '{OLD}'.\n")
n = 0
for d in docs:
status = d.get("STATUS", "") or ""
new_status = status.replace(OLD, NEW)
if new_status == status:
print(f"[SKIP] {d.get('prijmeni')} {d.get('jmeno')}: text nenalezen")
continue
print(f"[OK] {d.get('prijmeni')} {d.get('jmeno')}:")
print(f" '{status.splitlines()[0]}' -> '{new_status.splitlines()[0]}'")
if apply:
res = col.update_one({"_id": d["_id"]}, {"$set": {"STATUS": new_status}})
n += res.modified_count
print()
if apply:
print(f">>> ZAPSANO: {n} zaznamu. Ted spust classify_krok_v1.0.py --apply")
else:
print(">>> DRY-RUN. Pro zapis spust s --apply")
if __name__ == "__main__":
main()
@@ -0,0 +1,39 @@
<!--
=============================================================================
Nazev: sipiq_email_template_v1.0.html
Verze: 1.0
Datum: 2026-06-10
Popis: Schvalena sablona SIPIQ feasibility e-mailu (studie 77242113UCO3002 / DAWN).
Pouziti pres MCP vbcz-email create_draft_eml.
Placeholdery (nahradit pred generovanim draftu):
{{LINK}} - jedinecny SIPIQ Qualtrics odkaz centra (z Trilium note "SIPIQ", noteId hAMNUnUQdCRn)
POZOR: v <a href="..."> musi byt & jako &amp;
{{DEADLINE}} - termin vyplneni, format DD-MON-YYYY (napr. 17-JUN-2026); pravidlo = den odeslani + 7 dni
Fixni parametry create_draft_eml:
to = adresa lekare (overit z realne korespondence v JNJ schrance vbuzalka@its.jnj.com)
cc = AKocourk@ITS.JNJ.com, EBartoso@its.jnj.com
subject = 77242113UCO3002/Feasibility dotaznik
add_signature = false (podpis je primo v tele nize)
from_addr = vychozi (vbuzalka@its.jnj.com; na JNJ PC se doplni automaticky)
output_dir = u:\Dropbox\!!!Days\Downloads Z230\UploadToJNJ
filename = sipiq_<prijmeni>_<DDMONYYYY>.eml
Po odeslani -> zapis do Mongo feasibility.investigators (per _id):
KROK = "6 - SIPIQ odeslan"
sipiq.link, sipiq.link_token (cast Q_DL), sipiq.link_stored_at, sipiq.link_source="Trilium SIPIQ note"
STATUS prepend: "<DDMONYYYY>: SIPIQ odeslan (deadline {{DEADLINE}}; <adresa>)"
Specialni pravidlo: Stepek -> posilat na OBA jeho e-maily.
=============================================================================
-->
<p>Dobrý den,</p>
<p>ve společnosti Johnson &amp; Johnson posuzujeme centra zvažovaná pro studie rané fáze vývoje. Prvním krokem je vyplnění dotazníku SIPIQ (Site Interest Protocol Information Questionnaire), díky kterému lépe porozumíme postupům, zásadám a možnostem vašeho centra.</p>
<p>Níže najdete odkaz na dotazník SIPIQ specifický pro Vaše centrum. Vyplněný dotazník prosím odešlete do <b>{{DEADLINE}}</b>.</p>
<p>Odkaz: <a href="{{LINK}}">{{LINK}}</a></p>
<p>Moc prosím vyplňte formulář pečlivě, neuvádějte ani příliš optimistická, ani příliš pesimistická čísla. Na konci dotazníku jsou dotazy na etickou komisi — tyto s přehledem ignorujte, protože situace stran etické komise je nám jasná; vše se podává v rámci centralizovaného EU podání, jehož součástí je i centrální etická komise příslušné země.</p>
<p>Naopak nás velice zajímá dotaz ke konci, jak dlouho odhadujete, že bude trvat vyjednávání smlouvy — uveďte to prosím na základě svých zkušeností z předchozích studií.</p>
<p>Po vyplnění bude následovat hodnoticí návštěva v centru a finální rozhodnutí o výběru centra.</p>
<p>V případě dotazů se na nás neváhejte obrátit.</p>
<p>S pozdravem,</p>
<p>MUDr. Vladimír BUZALKA<br>ICON plc<br>Performing Local Trial Management Services for Janssen Cilag s.r.o.<br>Global Clinical Operations<br>Mobile: +420 775 735 276<br>Fax: +420 227 012 284<br>E-mail: vbuzalka@its.jnj.com, vladimir.buzalka@iconplc.com</p>
+649
View File
@@ -0,0 +1,649 @@
import os
import sys
import pandas as pd
from datetime import date
from pathlib import Path
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from common.mongo_writer import get_db
STUDIES = ["77242113UCO3001", "42847922MDD3003"]
BASE_DIR = Path(os.path.dirname(os.path.abspath(__file__)))
OUTPUT_DIR = BASE_DIR / "output"
DATE_COLUMNS = {
"Orig Exp Date", "Exp Date", "Rcv Date",
"Date Asgn", "Disp Date", "Date Ret", "Destroyed", "Max Visit Date",
"Visit Date", "Scheduled Date",
}
N_SHIP_COLS = 9 # počet shipment sloupců před detail sloupci
# ── Načítání dat z MongoDB ────────────────────────────────────────────────────
INVENTORY_COLS = [
("site", "Site"),
("medication_id", "Med ID"),
("packaged_lot_no", "Lot No."),
("original_expiration_date", "Orig Exp Date"),
("expiration_date", "Exp Date"),
("received_date", "Rcv Date"),
("receipt_user", "Rcpt User"),
("subject_identifier", "Subject ID"),
("quantity_assigned", "Qty Asgn"),
("irt_transaction", "IRT Tx"),
("date_assigned", "Date Asgn"),
("assignment_user", "Asgn User"),
("dispensation_status", "Disp Status"),
("dispensing_date", "Disp Date"),
("quantity_dispensed", "Qty Disp"),
("dispensing_user", "Disp User"),
("quantity_returned", "Qty Ret"),
("date_returned", "Date Ret"),
("return_user", "Ret User"),
]
def load_inventory(study):
db = get_db()
inv = list(db.iwrs_inventory.find({"study": study}))
destr = list(db.iwrs_destruction.find({"study": study}))
# map medication_id -> first basket+date
destr_map = {}
for d in destr:
mid = d.get("medication_id")
if mid and mid not in destr_map:
destr_map[mid] = (d.get("basket_id"), d.get("destruction_date"))
records = []
for doc in inv:
row = {label: doc.get(key) for key, label in INVENTORY_COLS}
b, dt = destr_map.get(doc.get("medication_id"), (None, None))
row["Destroyed"] = dt
row["Basket No."] = b
records.append(row)
df = pd.DataFrame(records)
if df.empty:
print(" Inventory: 0 kitu")
return df
df = df.sort_values(["Site", "Rcv Date", "Med ID"], na_position="last").reset_index(drop=True)
for col in DATE_COLUMNS:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors="coerce")
print(f" Inventory: {len(df)} kitu")
return df
SHIP_COLS = [
("shipment_id", "Shipment ID"),
("status", "IRT Shipment Status"),
("type", "Type"),
("ship_from", "Shipment From"),
("ship_to_site", "Ship To:"),
("request_date", "Request Date"),
("received_date", "Received Date"),
("received_by", "Received by"),
("expected_arrival", "Expected Arrival"),
]
ITEM_COLS = [
("investigator", "Investigator"),
("medication_description", "Medication Description"),
("medication_id", "Medication ID"),
("packaged_lot_no", "Packaged Lot number"),
("expiration_date", "Expiration Date"),
("item_status", "Status"),
]
def load_shipments(study):
db = get_db()
ships = list(db.iwrs_shipments.find({"study": study}))
items = list(db.iwrs_shipment_items.find({"study": study}))
# index items by shipment_id
items_by_ship = {}
for it in items:
items_by_ship.setdefault(it.get("shipment_id"), []).append(it)
records = []
for s in ships:
base = {label: s.get(key) for key, label in SHIP_COLS}
for it in items_by_ship.get(s.get("shipment_id"), []):
row = dict(base)
for key, label in ITEM_COLS:
row[label] = it.get(key)
records.append(row)
df = pd.DataFrame(records)
if df.empty:
print(" Shipments: 0 zásilek, 0 kitu")
return df
df = df.sort_values(["Ship To:", "Shipment ID", "Medication ID"], na_position="last").reset_index(drop=True)
for col in ("Request Date", "Received Date", "Expiration Date", "Expected Arrival"):
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors="coerce")
n_ship = df["Shipment ID"].nunique()
print(f" Shipments: {n_ship} zásilek, {len(df)} kitu")
return df
def load_visits(study):
db = get_db()
cur = db.iwrs_visits.find({
"study": study,
"visit_type": "Past",
"irt_transaction_no": {"$ne": None},
})
rows = []
for v in cur:
rows.append({
"Subject": v.get("subject"),
"Visit Date": v.get("actual_date") or v.get("scheduled_date"),
"Scheduled Date": v.get("scheduled_date"),
"IRT Tx No": v.get("irt_transaction_no"),
"Visit": v.get("irt_transaction_description"),
"Medication": v.get("medication_assignment"),
"medication_id": v.get("medication_id"),
"quantity_assigned": v.get("quantity_assigned"),
})
df = pd.DataFrame(rows)
if df.empty:
print(" Visits: 0 radku")
return df
# GROUP BY subject/actual/scheduled/irt_no/desc/medication
grouped = (
df.groupby(["Subject", "Visit Date", "Scheduled Date", "IRT Tx No", "Visit", "Medication"],
dropna=False, as_index=False)
.agg(**{
"Med IDs": ("medication_id", lambda s: ", ".join(sorted([str(x) for x in s if pd.notna(x)]))),
"Qty": ("quantity_assigned", "sum"),
})
)
grouped = grouped.sort_values(["Subject", "Visit Date"]).reset_index(drop=True)
for col in ("Visit Date", "Scheduled Date"):
if col in grouped.columns:
grouped[col] = pd.to_datetime(grouped[col], errors="coerce")
if study == "77242113UCO3001":
grouped["Visit"] = grouped["Visit"].replace("Subject Number Creation", "Screening")
print(f" Visits: {len(grouped)} řádků")
return grouped
# ── Odvozené sheety ───────────────────────────────────────────────────────────
def build_site_summary(shipments_df):
STATUS_COLS = ["Available", "Assigned", "Dispensed", "Returned by Subject"]
pivot = shipments_df.groupby("Ship To:")["Status"].value_counts().unstack(fill_value=0)
for s in STATUS_COLS:
if s not in pivot.columns:
pivot[s] = 0
pivot = (
pivot[STATUS_COLS]
.reset_index()
.rename(columns={"Ship To:": "Site", "Returned by Subject": "Returned"})
.sort_values("Site")
.reset_index(drop=True)
)
pivot["Total"] = pivot[["Available", "Assigned", "Dispensed", "Returned"]].sum(axis=1)
print(f" Site Summary: {len(pivot)} center")
return pivot
def build_expired(df):
today = date.today()
mask = (
df["Basket No."].isna() &
df["Subject ID"].isna() &
(df["Exp Date"] < pd.Timestamp(today))
)
filtered = df[mask].copy().reset_index(drop=True)
sheet_name = f"Expired as of {today.strftime('%d-%b-%Y')}"
print(f" Expired: {len(filtered)}")
return filtered, sheet_name
def build_assigned_not_dispensed(df):
mask = df["Subject ID"].notna() & df["Disp Date"].isna()
filtered = df[mask].copy().reset_index(drop=True)
print(f" Assigned not dispensed: {len(filtered)}")
return filtered
def build_not_returned(df):
no_ret = df[
df["Date Ret"].isna() &
df["Subject ID"].notna() &
(df["Disp Status"].fillna("").str.upper() != "NOT DISPENSED")
].copy()
max_asgn = df.groupby("Subject ID")["Date Asgn"].max().rename("Max Visit Date")
no_ret = no_ret.join(max_asgn, on="Subject ID")
filtered = no_ret[no_ret["Date Asgn"] < no_ret["Max Visit Date"]].copy()
filtered = filtered.drop(columns=["Qty Ret", "Date Ret", "Ret User", "Destroyed", "Basket No."])
filtered = filtered.reset_index(drop=True)
print(f" Not returned: {len(filtered)}")
return filtered
def build_kits_for_destruction(df):
mask = (
df["Basket No."].isna() &
(df["Date Ret"].notna() | (df["Disp Status"].fillna("").str.upper() == "NOT DISPENSED"))
)
filtered = (
df[mask]
.copy()
.sort_values(["Site", "Date Ret"], ascending=[True, True])
.drop(columns=["Destroyed", "Basket No."])
.reset_index(drop=True)
)
print(f" Kits for destruction: {len(filtered)}")
return filtered
# ── Formátování ───────────────────────────────────────────────────────────────
STRIPE_GRAY = PatternFill("solid", start_color="F2F2F2")
STRIPE_WHITE = PatternFill("solid", start_color="FFFFFF")
# pacienti — styly zachovány z create_subject_report.py
_PAT_HEADER_FILL = PatternFill("solid", start_color="1F4E79")
_PAT_HEADER_FONT = Font(name="Arial", bold=True, color="FFFFFF", size=10)
_PAT_NORMAL_FONT = Font(name="Arial", size=10)
_PAT_BOLD_FONT = Font(name="Arial", bold=True, size=10)
_PAT_STRIKE_FONT = Font(name="Arial", size=10, strike=True, color="999999")
_PAT_ADOLESC_FONT = Font(name="Arial", bold=True, size=10)
_PAT_THIN = Side(style="thin", color="CCCCCC")
_PAT_BORDER = Border(left=_PAT_THIN, right=_PAT_THIN, top=_PAT_THIN, bottom=_PAT_THIN)
_PAT_EVEN_FILL = PatternFill("solid", start_color="EBF3FB")
_PAT_ODD_FILL = PatternFill("solid", start_color="FFFFFF")
_PAT_CENTER = Alignment(horizontal="center", vertical="center")
_PAT_LEFT = Alignment(horizontal="left", vertical="center")
def _autofit(ws):
for col_cells in ws.columns:
max_len = 0
col_letter = get_column_letter(col_cells[0].column)
for cell in col_cells:
if cell.value is None:
continue
# datum se zobrazí jako DD-MMM-YYYY = 11 znaků
if hasattr(cell.value, "strftime") or cell.number_format == "DD-MMM-YYYY":
length = 11
else:
length = len(str(cell.value))
if length > max_len:
max_len = length
ws.column_dimensions[col_letter].width = min(max_len + 3, 50)
def format_sheet(ws, header_color, highlight_col=None, highlight_color=None):
thin = Side(style="thin", color="000000")
border = Border(left=thin, right=thin, top=thin, bottom=thin)
header_fill = PatternFill("solid", start_color=header_color)
header_font = Font(bold=True, color="FFFFFF", name="Arial", size=10)
row_font = Font(name="Arial", size=10)
hi_fill = PatternFill("solid", start_color=highlight_color) if highlight_color else None
headers = [cell.value for cell in ws[1]]
for cell in ws[1]:
cell.fill = header_fill
cell.font = header_font
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=False)
cell.border = border
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
stripe = STRIPE_GRAY if row[0].row % 2 == 0 else STRIPE_WHITE
for cell in row:
col_name = headers[cell.column - 1] if cell.column <= len(headers) else None
cell.font = row_font
cell.border = border
cell.alignment = Alignment(horizontal="center")
if col_name in DATE_COLUMNS:
cell.number_format = "DD-MMM-YYYY"
if hi_fill and col_name == highlight_col:
cell.fill = hi_fill
else:
cell.fill = stripe
_autofit(ws)
ws.auto_filter.ref = ws.dimensions
ws.freeze_panes = "A2"
def format_shipment_sheet(ws, header_color_ship, header_color_detail, n_ship_cols):
thin = Side(style="thin", color="000000")
border = Border(left=thin, right=thin, top=thin, bottom=thin)
hfont = Font(bold=True, color="FFFFFF", name="Arial", size=10)
dfont = Font(name="Arial", size=10)
fill_ship = PatternFill("solid", start_color=header_color_ship)
fill_detail = PatternFill("solid", start_color=header_color_detail)
for cell in ws[1]:
cell.fill = fill_ship if cell.column <= n_ship_cols else fill_detail
cell.font = hfont
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
cell.border = border
ws.row_dimensions[1].height = 30
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
stripe = STRIPE_GRAY if row[0].row % 2 == 0 else STRIPE_WHITE
for cell in row:
cell.font = dfont
cell.border = border
cell.alignment = Alignment(horizontal="center", vertical="center")
cell.fill = stripe
if cell.value.__class__.__name__ in ("datetime", "date", "Timestamp"):
cell.number_format = "DD-MMM-YYYY"
_autofit(ws)
ws.auto_filter.ref = ws.dimensions
ws.freeze_panes = "A2"
# ── Pacienti ─────────────────────────────────────────────────────────────────
def load_patients(study):
db = get_db()
docs = list(db.iwrs_subject_summary.find({"study": study}))
if not docs:
raise RuntimeError(f"Žádná data v Mongo pro pacienty {study}")
base_cols = [
("subject", "Subject"),
("investigator", "Investigator"),
("age", "Subject's age collection"),
("cohort_per_irt", "Cohort per IRT"),
("irt_subject_status", "IRT Subject Status"),
("last_irt_transaction", "Last Recorded IRT Transaction"),
("next_irt_transaction", "Next Expected IRT Transaction"),
("next_irt_transaction_date_local", "Next Expected IRT Transaction Date [Local]"),
]
uco_extra = [
("rescreened_subject", "Rescreened Subject"),
("adt_ir", "ADT-IR"),
("three_or_more_advanced_therapies", "3+ Adv. Therapies"),
("only_oral_5asa_compounds", "Only 5-ASA"),
("ustekinumab", "Ustekinumab"),
("isolated_proctitis", "Isolated Proctitis"),
]
cols = list(base_cols)
if study == "77242113UCO3001":
cols += uco_extra
rows = [{label: d.get(key) for key, label in cols} for d in docs]
df = pd.DataFrame(rows).sort_values("Subject").reset_index(drop=True)
if "Next Expected IRT Transaction Date [Local]" in df.columns:
df["Next Expected IRT Transaction Date [Local]"] = pd.to_datetime(
df["Next Expected IRT Transaction Date [Local]"], errors="coerce"
)
print(f" Pacienti: {len(df)} subjektů")
return df
def _simplify_cohort(val):
if pd.isna(val):
return ""
val = str(val)
if "dolescent" in val:
return "Adolescent"
if val.startswith("Adult"):
return "Adult"
return val
def _fmt_date(val):
if pd.isna(val):
return ""
if hasattr(val, "strftime"):
return val.strftime("%Y-%m-%d")
return str(val)[:10]
def _write_prehled(wb, df_raw, study):
ws = wb.create_sheet("Přehled", 0)
ws.sheet_view.showGridLines = False
is_uco = (study == "77242113UCO3001")
if is_uco:
display_headers = ["Subject", "Investigator", "Věk", "Cohort",
"Rescreened", "ADT-IR", "≥3 Adv.Th.", "5-ASA only",
"Uste.", "Isol.Proct.",
"Status", "Last IRT", "Next Visit", "Next Date"]
col_widths = [14, 22, 6, 12, 11, 8, 11, 10, 8, 12, 14, 12, 12, 13]
status_col = 11
flag_cols = set(range(5, 11)) # 1-indexed sloupce s Yes/No hodnotami
else:
display_headers = ["Subject", "Investigator", "Věk", "Cohort", "Status", "Last IRT", "Next Visit", "Next Date"]
col_widths = [14, 22, 6, 12, 14, 12, 12, 13]
status_col = 5
flag_cols = set()
last_col = get_column_letter(len(display_headers))
ws.merge_cells(f"A1:{last_col}1")
title = ws["A1"]
title.value = f"Subject Summary — {study} ({date.today().strftime('%d-%b-%Y')})"
title.font = Font(name="Arial", bold=True, size=12, color="1F4E79")
title.alignment = Alignment(horizontal="left", vertical="center")
ws.row_dimensions[1].height = 22
for c, (h, w) in enumerate(zip(display_headers, col_widths), 1):
cell = ws.cell(row=2, column=c, value=h)
cell.font = _PAT_HEADER_FONT
cell.fill = _PAT_HEADER_FILL
cell.alignment = _PAT_CENTER
cell.border = _PAT_BORDER
ws.column_dimensions[get_column_letter(c)].width = w
ws.row_dimensions[2].height = 18
base = {
"Subject": df_raw["Subject"].fillna(""),
"Investigator": df_raw["Investigator"].fillna(""),
"Věk": df_raw["Subject's age collection"].apply(lambda v: "" if pd.isna(v) else int(v)),
"Cohort": df_raw["Cohort per IRT"].apply(_simplify_cohort),
}
if is_uco:
base.update({
"Rescreened": df_raw["Rescreened Subject"].fillna(""),
"ADT-IR": df_raw["ADT-IR"].fillna(""),
"≥3 Adv.Th.": df_raw["3+ Adv. Therapies"].fillna(""),
"5-ASA only": df_raw["Only 5-ASA"].fillna(""),
"Uste.": df_raw["Ustekinumab"].fillna(""),
"Isol.Proct.": df_raw["Isolated Proctitis"].fillna(""),
})
base.update({
"Status": df_raw["IRT Subject Status"].fillna(""),
"Last IRT": df_raw["Last Recorded IRT Transaction"].fillna(""),
"Next Visit": df_raw["Next Expected IRT Transaction"].fillna(""),
"Next Date": df_raw["Next Expected IRT Transaction Date [Local]"].apply(_fmt_date),
})
display = pd.DataFrame(base).sort_values("Subject").reset_index(drop=True)
for r_idx, row in display.iterrows():
excel_row = r_idx + 3
status = str(row["Status"])
is_failed = "Screen Failed" in status or "Discontinued" in status
is_randomized = "Randomized" in status
is_adolescent = row["Cohort"] == "Adolescent"
fill = _PAT_EVEN_FILL if r_idx % 2 == 0 else _PAT_ODD_FILL
for c_idx, val in enumerate(row, 1):
cell = ws.cell(row=excel_row, column=c_idx, value=val if val != "" else None)
cell.fill = fill
cell.border = _PAT_BORDER
cell.alignment = _PAT_CENTER if (c_idx == 3 or c_idx in flag_cols) else _PAT_LEFT
if is_failed:
cell.font = _PAT_STRIKE_FONT
elif c_idx == status_col and is_randomized:
cell.font = _PAT_BOLD_FONT
elif c_idx == 4 and is_adolescent:
cell.font = _PAT_ADOLESC_FONT
else:
cell.font = _PAT_NORMAL_FONT
ws.row_dimensions[excel_row].height = 16
ws.freeze_panes = "A3"
ws.auto_filter.ref = f"A2:{last_col}{len(display) + 2}"
def _write_next_visits(wb, df_raw, study, visits_df=None):
ws = wb.create_sheet("Next Visits", 1)
ws.sheet_view.showGridLines = False
ws.merge_cells("A1:D1")
title = ws["A1"]
title.value = f"Next Expected Visits — {study} ({date.today().strftime('%d-%b-%Y')})"
title.font = Font(name="Arial", bold=True, size=12, color="1F4E79")
title.alignment = Alignment(horizontal="left", vertical="center")
ws.row_dimensions[1].height = 22
nv_headers = ["Subject", "Investigator", "Next Visit", "Datum"]
nv_widths = [14, 22, 26, 13]
for c, (h, w) in enumerate(zip(nv_headers, nv_widths), 1):
cell = ws.cell(row=2, column=c, value=h)
cell.font = _PAT_HEADER_FONT
cell.fill = _PAT_HEADER_FILL
cell.alignment = _PAT_CENTER
cell.border = _PAT_BORDER
ws.column_dimensions[get_column_letter(c)].width = w
ws.row_dimensions[2].height = 18
df = pd.DataFrame({
"Subject": df_raw["Subject"].fillna(""),
"Investigator": df_raw["Investigator"].fillna(""),
"Next Visit": df_raw["Next Expected IRT Transaction"].fillna(""),
"Datum": df_raw["Next Expected IRT Transaction Date [Local]"],
"Status": df_raw["IRT Subject Status"].fillna(""),
})
# I-0: datum = screening date + 42 dní
if visits_df is not None and not visits_df.empty:
screen = (
visits_df[visits_df["Visit"].str.contains("Screen", case=False, na=False)]
.groupby("Subject")["Visit Date"].min()
.rename("Screening Date")
)
df = df.join(screen, on="Subject")
mask_i0 = df["Next Visit"].str.contains("I-0", na=False)
df.loc[mask_i0, "Datum"] = df.loc[mask_i0, "Screening Date"] + pd.Timedelta(days=42)
df = df.drop(columns=["Screening Date"])
df = df[df["Datum"].notna()]
df = df[~df["Status"].str.contains("Screen Failed|Discontinued", na=False)]
df = df.sort_values("Datum").reset_index(drop=True)
for r_idx, row in df.iterrows():
excel_row = r_idx + 3
fill = _PAT_EVEN_FILL if r_idx % 2 == 0 else _PAT_ODD_FILL
datum_val = row["Datum"]
datum_str = datum_val.strftime("%Y-%m-%d") if hasattr(datum_val, "strftime") else str(datum_val)[:10]
for c_idx, val in enumerate([row["Subject"], row["Investigator"], row["Next Visit"], datum_str], 1):
cell = ws.cell(row=excel_row, column=c_idx, value=val if val != "" else None)
cell.fill = fill
cell.border = _PAT_BORDER
cell.font = _PAT_NORMAL_FONT
cell.alignment = _PAT_LEFT
ws.row_dimensions[excel_row].height = 16
ws.freeze_panes = "A3"
ws.auto_filter.ref = f"A2:D{len(df) + 2}"
# ── Jeden report pro jednu studii ─────────────────────────────────────────────
def create_study_report(study):
today = date.today()
# číslování: najdi nejvyšší existující verzi pro dnešní datum
existing = sorted(OUTPUT_DIR.glob(f"{today} {study} CZ IWRS overview v*.xlsx"))
if existing:
last = existing[-1].stem # např. "2026-05-12 42847922MDD3003 CZ IWRS overview v3"
last_ver = int(last.rsplit("v", 1)[-1])
version = last_ver + 1
else:
version = 1
output_file = OUTPUT_DIR / f"{today} {study} CZ IWRS overview v{version}.xlsx"
print(f"\n[{study}] Nacitam z MongoDB...")
df = load_inventory(study)
shipments_df = load_shipments(study)
df_patients = load_patients(study)
visits_df = load_visits(study)
expired_df, expired_sheet = build_expired(df)
assigned_df = build_assigned_not_dispensed(df)
not_returned_df = build_not_returned(df)
destruction_df = build_kits_for_destruction(df)
site_summary_df = build_site_summary(shipments_df)
with pd.ExcelWriter(output_file, engine="openpyxl") as writer:
df.to_excel( writer, index=False, sheet_name="CountryMedicationOverview")
expired_df.to_excel( writer, index=False, sheet_name=expired_sheet)
assigned_df.to_excel( writer, index=False, sheet_name="Assigned not dispensed")
not_returned_df.to_excel( writer, index=False, sheet_name="Not returned")
destruction_df.to_excel( writer, index=False, sheet_name="Kits for destruction")
shipments_df.to_excel( writer, index=False, sheet_name="Shipments")
site_summary_df.to_excel( writer, index=False, sheet_name="Site Summary")
visits_df.to_excel( writer, index=False, sheet_name="Patient Visits")
wb = load_workbook(output_file)
ws_main = wb["CountryMedicationOverview"]
format_sheet(ws_main, header_color="1F4E79")
green_fill = PatternFill("solid", start_color="E2EFDA")
headers_main = [c.value for c in ws_main[1]]
for row in ws_main.iter_rows(min_row=2, max_row=ws_main.max_row):
for cell in row:
col_name = headers_main[cell.column - 1] if cell.column <= len(headers_main) else None
if col_name in ("Destroyed", "Basket No."):
cell.fill = green_fill
format_sheet(wb[expired_sheet], header_color="C00000", highlight_col="Exp Date", highlight_color="FFE0E0")
format_sheet(wb["Assigned not dispensed"], header_color="833C00", highlight_col="Subject ID", highlight_color="FFF2CC")
format_sheet(wb["Not returned"], header_color="375623", highlight_col="Max Visit Date", highlight_color="E2EFDA")
format_sheet(wb["Kits for destruction"], header_color="595959")
format_shipment_sheet(wb["Shipments"], "1F4E79", "375623", N_SHIP_COLS)
format_sheet(wb["Site Summary"], header_color="1F4E79")
format_sheet(wb["Patient Visits"], header_color="1F4E79")
# ── pacienti (Přehled + Next Visits) na začátek ──────────────────────────
_write_prehled(wb, df_patients, study)
_write_next_visits(wb, df_patients, study, visits_df)
# ── pořadí listů: Patient Visits jako první ──────────────────────────────
names = wb.sheetnames
wb._sheets = [wb["Patient Visits"]] + [wb[s] for s in names if s != "Patient Visits"]
wb.save(output_file)
print(f" Uloženo: {output_file.name} ({len(df)} řádků)")
# ── Main ──────────────────────────────────────────────────────────────────────
def main():
OUTPUT_DIR.mkdir(exist_ok=True)
for study in STUDIES:
try:
create_study_report(study)
except Exception as e:
import traceback
print(f"\n[{study}] CHYBA: {e}")
traceback.print_exc()
print("\nHotovo.")
main()
+253
View File
@@ -0,0 +1,253 @@
"""
Import Drugs dat (shipments, shipment_items, inventory, destruction) z XLSX do MongoDB.
Volá se z IWRS/Drugs/run_all.py po stažení reportů.
"""
import os
import sys
import re
import glob
import pandas as pd
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
from common.mongo_writer import (
to_str, to_int, to_date,
ensure_indexes, log_import,
bulk_upsert_with_snapshot, bulk_upsert_only,
)
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
# ── XLSX parsery (převzaté z run_all.py + úprava na Mongo dokumenty) ─────────
def parse_shipments_report(study):
path = os.path.join(BASE_DIR, f"xls_shipments_{study}", f"shipments_report_{study}.xlsx")
if not os.path.exists(path):
print(f" CHYBI: {path}")
return []
raw = pd.read_excel(path, header=None)
header_row = None
for i, row in raw.iterrows():
if "Shipment ID" in [str(v).strip() for v in row]:
header_row = i
break
if header_row is None:
return []
df = pd.read_excel(path, header=header_row).dropna(how="all")
df = df[df["Location"].astype(str).str.contains("Czech", na=False, case=False)]
col = df.columns.tolist()
rows = []
for _, r in df.iterrows():
sid = to_str(r["Shipment ID"])
if not sid:
continue
rows.append({
"_id": sid,
"shipment_id": sid,
"study": study,
"status": to_str(r["IRT Shipment Status"]),
"type": to_str(r["Type"]),
"ship_from": to_str(r["Shipment From"]),
"ship_to_site": to_str(r["Ship To:"]),
"location": to_str(r["Location"]),
"request_date": to_date(r["Request Date"]),
"shipped_date": to_date(r["Shipped Date"]),
"received_date": to_date(r["Received Date"]) if "Received Date" in col else None,
"received_by": to_str(r["Received by"]) if "Received by" in col else None,
"delivered_date_utc": to_date(r["Delivered Date [UTC]"]) if "Delivered Date [UTC]" in col else None,
"delivery_recipient": to_str(r["Delivery Recipient"]) if "Delivery Recipient" in col else None,
"delivery_details": to_str(r["Delivery Details"]) if "Delivery Details" in col else None,
"cancelled_date": to_date(r["Cancelled Date"]) if "Cancelled Date" in col else None,
"total_medication_ids": to_int(r["Total Medication IDs"]) if "Total Medication IDs" in col else None,
"tracking_no": to_str(r["Tracking #"]) if "Tracking #" in col else None,
"shipping_category": to_str(r["Shipping Category"]) if "Shipping Category" in col else None,
"expected_arrival": to_date(r["Expected Arrival"]) if "Expected Arrival" in col else None,
})
return rows
def parse_shipment_details(study):
detail_dir = os.path.join(BASE_DIR, f"xls_shipment_details_{study}")
files = sorted(glob.glob(os.path.join(detail_dir, "shipment_details_*.xlsx")))
rows = []
for path in files:
m = re.search(r"shipment_details_(.+)\.xlsx", os.path.basename(path))
shipment_id = m.group(1) if m else "UNKNOWN"
raw = pd.read_excel(path, header=None)
header_row = None
for i, row in raw.iterrows():
if "Medication ID" in [str(v).strip() for v in row]:
header_row = i
break
if header_row is None:
continue
df = pd.read_excel(path, header=header_row).dropna(how="all")
for _, r in df.iterrows():
med_desc = (to_str(r.get("Medication Description"))
or to_str(r.get("Medication ID Description")))
med_type = (to_str(r.get("Medication type"))
or to_str(r.get("Medication ID type")))
med_id = to_str(r.get("Medication ID"))
if not med_id:
continue
rows.append({
"_id": f"{shipment_id}:{med_id}",
"study": study,
"shipment_id": shipment_id,
"destination_location": to_str(r.get("Destination Location")),
"shipment_status": to_str(r.get("IRT Shipment Status")),
"shipment_type": to_str(r.get("Type")),
"destination_site": to_str(r.get("Destination Site")),
"investigator": to_str(r.get("Investigator")),
"medication_description": med_desc,
"medication_type": med_type,
"medication_id": med_id,
"packaged_lot_no": to_str(r.get("Packaged Lot number")),
"packaged_lot_description": to_str(r.get("Packaged Lot description")),
"container_id": to_str(r.get("Container ID")),
"quantity": to_int(r.get("Quantity of Medication IDs")),
"expiration_date": to_date(r.get("Expiration Date")),
"item_status": to_str(r.get("Status")),
})
# dedupe (poslední vyhrává)
by_id = {r["_id"]: r for r in rows}
return list(by_id.values())
def parse_inventory(study):
inv_dir = os.path.join(BASE_DIR, f"xls_reports_{study}")
files = sorted(glob.glob(os.path.join(inv_dir, "onsite_inventory_detail_*.xlsx")))
rows = []
for path in files:
raw = pd.read_excel(path, header=None)
site = investigator = location = None
header_row = None
for i, row in raw.iterrows():
first = str(row.iloc[0]).strip() if pd.notna(row.iloc[0]) else ""
if first.startswith("Site:"):
site = first.replace("Site:", "").strip()
elif first.startswith("Investigator:"):
investigator = first.replace("Investigator:", "").strip()
elif first.startswith("Location:"):
location = first.replace("Location:", "").strip()
if first in ("Medication", "Medication ID") and header_row is None:
header_row = i
if header_row is None:
continue
df = pd.read_excel(path, header=header_row).dropna(how="all")
df = df.rename(columns={df.columns[0]: "medication_id"})
for _, r in df.iterrows():
med_id = to_str(r["medication_id"])
if not med_id or not site:
continue
rows.append({
"_id": f"{site}:{med_id}",
"study": study,
"site": site,
"investigator": investigator,
"location": location,
"medication_id": med_id,
"packaged_lot_no": to_str(r.get("Packaged Lot number")),
"original_expiration_date": to_date(r.get("Original Expiration Date when Packaged Lot was Added")),
"expiration_date": to_date(r.get("Expiration date")),
"received_date": to_date(r.get("Received Date")),
"receipt_user": to_str(r.get("Shipment Receipt User")),
"subject_identifier": to_str(r.get("Subject Identifier")),
"quantity_assigned": to_int(r.get("Quantity Assigned")),
"irt_transaction": to_str(r.get("IRT Transaction")),
"date_assigned": to_date(r.get("Date Assigned")),
"assignment_user": to_str(r.get("Assignment User")),
"dispensation_status": to_str(r.get("Dispensation Status")),
"dispensing_date": to_date(r.get("Dispensing date") or r.get("Dispensing Date")),
"quantity_dispensed": to_int(r.get("Quantity Dispensed")),
"dispensing_user": to_str(r.get("Dispensing User")),
"quantity_returned": to_int(r.get("Quantity Returned")),
"date_returned": to_date(r.get("Date Returned")),
"return_user": to_str(r.get("Return User")),
})
by_id = {r["_id"]: r for r in rows}
return list(by_id.values())
def parse_destruction_files(study):
dest_dir = os.path.join(BASE_DIR, f"xls_ip_destruction_{study}")
files = sorted(glob.glob(os.path.join(dest_dir, "ip_destruction_basket_*.xlsx")))
rows = []
for path in files:
raw = pd.read_excel(path, header=None)
meta = {}
header_row = None
for i, row in raw.iterrows():
first = str(row.iloc[0]).strip() if pd.notna(row.iloc[0]) else ""
for key, attr in [
("Investigator Name:", "investigator"),
("Site ID:", "site_id"),
("Location:", "location"),
("Basket ID:", "basket_id"),
("Drug Destruction Created Date:", "destruction_date"),
]:
if first.startswith(key):
meta[attr] = first.replace(key, "").strip()
if first == "Medication ID Description" and header_row is None:
header_row = i
if header_row is None:
continue
df = pd.read_excel(path, header=header_row).dropna(how="all")
basket_id = meta.get("basket_id")
for _, r in df.iterrows():
med_id = to_str(r.get("Medication ID"))
if not med_id or not basket_id:
continue
rows.append({
"_id": f"{basket_id}:{med_id}",
"study": study,
"site_id": meta.get("site_id"),
"investigator": meta.get("investigator"),
"location": meta.get("location"),
"basket_id": basket_id,
"destruction_date": to_date(meta.get("destruction_date")),
"medication_description": to_str(r.get("Medication ID Description")),
"medication_id": med_id,
"packaged_lot_description": to_str(r.get("Packaged Lot description")),
"comments": to_str(r.get("Comments")),
})
by_id = {r["_id"]: r for r in rows}
return list(by_id.values())
# ── hlavní import ────────────────────────────────────────────────────────────
def import_study(study):
print(f"\n [{study}] parsovani XLSX...")
shipments = parse_shipments_report(study)
items = parse_shipment_details(study)
inventory = parse_inventory(study)
destruct = parse_destruction_files(study)
print(f" Zasilky: {len(shipments)} | Polozky: {len(items)} | Sklad: {len(inventory)} | Destrukce: {len(destruct)}")
import_id = log_import(study, f"drugs_{study}", "drugs", {
"shipments": len(shipments),
"shipment_items": len(items),
"inventory": len(inventory),
"destruction": len(destruct),
})
print(f" import_id = {import_id}")
bulk_upsert_with_snapshot("iwrs_shipments", "iwrs_shipments_snapshots", shipments, import_id)
bulk_upsert_with_snapshot("iwrs_shipment_items", "iwrs_shipment_items_snapshots", items, import_id)
bulk_upsert_with_snapshot("iwrs_inventory", "iwrs_inventory_snapshots", inventory, import_id)
bulk_upsert_only("iwrs_destruction", destruct, import_id)
def run(studies):
ensure_indexes()
for s in studies:
import_study(s)
if __name__ == "__main__":
studies = sys.argv[1:] if len(sys.argv) > 1 else ["77242113UCO3001", "42847922MDD3003"]
run(studies)
+245
View File
@@ -0,0 +1,245 @@
"""
Kompletní pipeline pro Drugs:
1. Onsite inventory detail (per site, vždy přepisuje)
2. IP destruction (per košík, přeskočí již existující soubory)
3. Shipments report (jeden soubor na studii, přepisuje)
4. Shipment details (per zásilka CZ, vždy přepisuje)
5. Import do MongoDB (studie.iwrs_shipments / iwrs_shipment_items / iwrs_inventory / iwrs_destruction)
Spusť tento skript — zpracuje obě studie automaticky.
"""
import os
import glob
import re
import datetime
import sys
import pandas as pd
from playwright.sync_api import sync_playwright
import import_to_mongo as drugs_mongo
BASE_URL = "https://janssen.4gclinical.com"
EMAIL = "vbuzalka@its.jnj.com"
PASSWORD = "Vlado123++-+"
STUDIES = ["77242113UCO3001", "42847922MDD3003"]
SITES = {
"77242113UCO3001": [
"DD5-CZ10001", "DD5-CZ10003", "DD5-CZ10006", "DD5-CZ10009",
"DD5-CZ10010", "DD5-CZ10012", "DD5-CZ10013", "DD5-CZ10015",
"DD5-CZ10016", "DD5-CZ10020", "DD5-CZ10021", "DD5-CZ10022",
],
"42847922MDD3003": [
"S10-CZ10002", "S10-CZ10004", "S10-CZ10005",
"S10-CZ10008", "S10-CZ10011", "S10-CZ10012",
],
}
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
# ── login ────────────────────────────────────────────────────────────────────
def login(page, study):
page.goto(BASE_URL)
page.wait_for_load_state("networkidle")
page.get_by_label("Email *").fill(EMAIL)
page.get_by_label("Password *").fill(PASSWORD)
page.locator("#login__submit").click()
page.wait_for_load_state("networkidle")
page.get_by_label("Study *").click()
page.get_by_role("option", name=study).click()
page.get_by_role("button", name="SELECT").click()
page.wait_for_load_state("networkidle")
# ── download funkce ──────────────────────────────────────────────────────────
def download_inventory(page, study):
out_dir = os.path.join(BASE_DIR, f"xls_reports_{study}")
os.makedirs(out_dir, exist_ok=True)
page.goto(f"{BASE_URL}/report/onsite_inventory_detail")
page.wait_for_load_state("networkidle", timeout=120000)
for site_id in SITES[study]:
print(f" [{site_id}] inventory...")
page.locator('input[placeholder="search"], input[type="text"]').first.click()
page.get_by_role("option", name=site_id).click()
page.wait_for_load_state("networkidle", timeout=120000)
filename = os.path.join(out_dir, f"onsite_inventory_detail_{site_id}.xlsx")
with page.expect_download(timeout=120000) as dl:
page.get_by_role("button", name="Download XLS").click()
dl.value.save_as(filename)
page.get_by_role("button", name="Clear").click()
page.wait_for_load_state("networkidle", timeout=120000)
print(f" Inventory OK ({len(SITES[study])} center)")
def download_destruction(page, study):
out_dir = os.path.join(BASE_DIR, f"xls_ip_destruction_{study}")
os.makedirs(out_dir, exist_ok=True)
page.goto(f"{BASE_URL}/report/ip_destruction_form")
page.wait_for_load_state("networkidle", timeout=120000)
page.locator('input[placeholder="search"], input[type="text"]').first.click()
page.wait_for_timeout(1000)
baskets = [b.strip() for b in page.locator("mat-option").all_inner_texts()
if b.strip() and b.strip() != "No results found"]
page.keyboard.press("Escape")
page.wait_for_timeout(500)
if not baskets:
print(" Žádné destruction košíky")
return
new_count = 0
for basket in baskets:
filename = os.path.join(out_dir, f"ip_destruction_basket_{basket}.xlsx")
if os.path.exists(filename):
continue # destrukce se nemění — přeskočit
print(f" [košík {basket}] stahování...")
input_field = page.locator('input[placeholder="search"], input[type="text"]').first
input_field.click()
input_field.fill(basket)
page.wait_for_timeout(500)
page.locator("mat-option").first.dispatch_event("click")
page.wait_for_load_state("networkidle", timeout=120000)
with page.expect_download(timeout=120000) as dl:
page.get_by_role("button", name="Download XLS").click()
dl.value.save_as(filename)
new_count += 1
page.get_by_role("button", name="Clear").click()
page.wait_for_load_state("networkidle", timeout=120000)
print(f" Destruction OK ({new_count} nových, {len(baskets) - new_count} přeskočeno)")
def download_shipments_report(page, study):
out_dir = os.path.join(BASE_DIR, f"xls_shipments_{study}")
os.makedirs(out_dir, exist_ok=True)
page.goto(f"{BASE_URL}/report/shipments_report")
page.wait_for_load_state("networkidle", timeout=120000)
filename = os.path.join(out_dir, f"shipments_report_{study}.xlsx")
with page.expect_download(timeout=120000) as dl:
page.get_by_role("button", name="Download XLS").click()
dl.value.save_as(filename)
print(f" Shipments report OK")
def download_shipment_details(page, study):
out_dir = os.path.join(BASE_DIR, f"xls_shipment_details_{study}")
os.makedirs(out_dir, exist_ok=True)
# načti CZ shipment IDs z právě staženého shipments reportu
report_path = os.path.join(BASE_DIR, f"xls_shipments_{study}", f"shipments_report_{study}.xlsx")
raw = pd.read_excel(report_path, header=None)
header_row = None
for i, row in raw.iterrows():
if "Shipment ID" in [str(v).strip() for v in row]:
header_row = i
break
df = pd.read_excel(report_path, header=header_row)
df = df.dropna(how="all")
df = df[df["Location"].astype(str).str.contains("Czech", na=False, case=False)]
cz_shipments = list(zip(
df["Shipment ID"].astype(str).str.strip(),
df["IRT Shipment Status"].astype(str).str.strip() if "IRT Shipment Status" in df.columns else [""] * len(df),
))
print(f" CZ zásilek ke stažení: {len(cz_shipments)}")
page.goto(f"{BASE_URL}/report/shipment_details_report")
page.wait_for_load_state("networkidle", timeout=120000)
skipped = 0
for shipment, status in cz_shipments:
filename = os.path.join(out_dir, f"shipment_details_{shipment}.xlsx")
if os.path.exists(filename) and status.upper() == "RECEIVED":
skipped += 1
continue # finální stav, soubor se nemění
input_field = page.locator('input[placeholder="search"], input[type="text"]').first
input_field.click()
input_field.fill(shipment)
page.wait_for_timeout(500)
page.locator("mat-option").first.dispatch_event("click")
page.wait_for_load_state("networkidle", timeout=120000)
with page.expect_download(timeout=120000) as dl:
page.get_by_role("button", name="Download XLS").click()
dl.value.save_as(filename)
print(f" [{shipment}] ({status}) OK")
page.get_by_role("button", name="Clear").click()
page.wait_for_load_state("networkidle", timeout=120000)
print(f" Přeskočeno (RECEIVED): {skipped}")
# ── main ─────────────────────────────────────────────────────────────────────
def main():
os.chdir(BASE_DIR)
# ── Stahování ────────────────────────────────────────────────────────────
with sync_playwright() as p:
for study in STUDIES:
print(f"\n{'='*60}")
print(f"[{study}] STAHOVÁNÍ")
print(f"{'='*60}")
browser = p.chromium.launch(headless=False)
context = browser.new_context(accept_downloads=True)
page = context.new_page()
try:
print(" Přihlášení...")
login(page, study)
print("\n [1/4] Onsite inventory...")
download_inventory(page, study)
print("\n [2/4] IP destruction...")
download_destruction(page, study)
print("\n [3/4] Shipments report...")
download_shipments_report(page, study)
print("\n [4/4] Shipment details (CZ)...")
download_shipment_details(page, study)
except Exception as e:
import traceback
print(f" CHYBA při stahování: {e}")
traceback.print_exc()
finally:
browser.close()
# ── Import do MongoDB ─────────────────────────────────────────────────────
print(f"\n{'='*60}")
print("IMPORT DO MongoDB")
print(f"{'='*60}")
try:
drugs_mongo.run(STUDIES)
except Exception as e:
import traceback
print(f" CHYBA při importu: {e}")
traceback.print_exc()
print(f"\n{'='*60}")
print("Vše hotovo.")
print(f"{'='*60}")
main()
+139
View File
@@ -0,0 +1,139 @@
import mysql.connector
import db_config
conn = mysql.connector.connect(
host=db_config.DB_HOST, port=db_config.DB_PORT,
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
database=db_config.DB_NAME
)
c = conn.cursor()
# Přidat report_type do iwrs_import (pokud ještě neexistuje)
try:
c.execute("""ALTER TABLE iwrs_import
ADD COLUMN report_type VARCHAR(20) NOT NULL DEFAULT 'patients'
AFTER source_file""")
print("ALTER TABLE iwrs_import OK — report_type přidán")
except mysql.connector.errors.DatabaseError as e:
if "Duplicate column" in str(e):
print("report_type již existuje — přeskočeno")
else:
raise
stmts = [
(
"iwrs_shipments",
"""CREATE TABLE IF NOT EXISTS iwrs_shipments (
id INT AUTO_INCREMENT PRIMARY KEY,
import_id INT NOT NULL,
study VARCHAR(20) NOT NULL,
shipment_id VARCHAR(20) NOT NULL,
status VARCHAR(50),
type VARCHAR(30),
ship_from VARCHAR(50),
ship_to_site VARCHAR(50),
location VARCHAR(50),
request_date DATE,
shipped_date DATE,
received_date DATE,
received_by VARCHAR(100),
delivered_date_utc DATE,
delivery_recipient VARCHAR(100),
delivery_details VARCHAR(200),
cancelled_date DATE,
total_medication_ids SMALLINT,
tracking_no VARCHAR(100),
shipping_category VARCHAR(50),
expected_arrival DATE,
FOREIGN KEY (import_id) REFERENCES iwrs_import(import_id),
INDEX idx_import (import_id),
INDEX idx_study_shipment (study, shipment_id)
)"""
),
(
"iwrs_shipment_items",
"""CREATE TABLE IF NOT EXISTS iwrs_shipment_items (
id INT AUTO_INCREMENT PRIMARY KEY,
import_id INT NOT NULL,
study VARCHAR(20) NOT NULL,
shipment_id VARCHAR(20) NOT NULL,
destination_location VARCHAR(50),
shipment_status VARCHAR(50),
shipment_type VARCHAR(30),
destination_site VARCHAR(50),
investigator VARCHAR(100),
medication_description VARCHAR(200),
medication_type VARCHAR(50),
medication_id VARCHAR(20),
packaged_lot_no VARCHAR(50),
packaged_lot_description VARCHAR(100),
container_id VARCHAR(50),
quantity SMALLINT,
expiration_date DATE,
item_status VARCHAR(50),
FOREIGN KEY (import_id) REFERENCES iwrs_import(import_id),
INDEX idx_import (import_id),
INDEX idx_med_id (medication_id)
)"""
),
(
"iwrs_inventory",
"""CREATE TABLE IF NOT EXISTS iwrs_inventory (
id INT AUTO_INCREMENT PRIMARY KEY,
import_id INT NOT NULL,
study VARCHAR(20) NOT NULL,
site VARCHAR(50),
investigator VARCHAR(100),
location VARCHAR(50),
medication_id VARCHAR(20),
packaged_lot_no VARCHAR(50),
original_expiration_date DATE,
expiration_date DATE,
received_date DATE,
receipt_user VARCHAR(100),
subject_identifier VARCHAR(20),
quantity_assigned SMALLINT,
irt_transaction VARCHAR(100),
date_assigned DATE,
assignment_user VARCHAR(100),
dispensation_status VARCHAR(50),
dispensing_date DATE,
quantity_dispensed SMALLINT,
dispensing_user VARCHAR(100),
quantity_returned SMALLINT,
date_returned DATE,
return_user VARCHAR(100),
FOREIGN KEY (import_id) REFERENCES iwrs_import(import_id),
INDEX idx_import (import_id),
INDEX idx_site (study, site)
)"""
),
(
"iwrs_destruction",
"""CREATE TABLE IF NOT EXISTS iwrs_destruction (
id INT AUTO_INCREMENT PRIMARY KEY,
study VARCHAR(20) NOT NULL,
site_id VARCHAR(50),
investigator VARCHAR(100),
location VARCHAR(50),
basket_id VARCHAR(20) NOT NULL,
destruction_date DATE,
medication_description VARCHAR(200),
medication_id VARCHAR(20),
packaged_lot_description VARCHAR(100),
comments VARCHAR(500),
imported_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
UNIQUE KEY uq_destruction (study, basket_id, medication_id),
INDEX idx_study_basket (study, basket_id)
)"""
),
]
for name, sql in stmts:
c.execute(sql)
print(f"OK: {name}")
conn.commit()
c.close()
conn.close()
print("\nVšechny tabulky připraveny.")
@@ -0,0 +1,364 @@
import sys
import os
import mysql.connector
import pandas as pd
from datetime import date
from pathlib import Path
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), ".."))
import db_config
STUDY = "42847922MDD3003"
# STUDY = "77242113UCO3001"
BASE_DIR = Path(os.path.dirname(os.path.abspath(__file__)))
OUTPUT_DIR = BASE_DIR / "output"
OUTPUT_FILE = OUTPUT_DIR / f"{date.today().strftime('%Y-%m-%d')} {STUDY} CZ IWRS overview.xlsx"
DATE_COLUMNS = {
"Orig Exp Date", "Exp Date", "Rcv Date",
"Date Asgn", "Disp Date", "Date Ret", "Destroyed", "Max Visit Date",
}
COLUMN_WIDTHS = {
"Site": 14,
"Med ID": 10,
"Lot No.": 12,
"Orig Exp Date": 16,
"Exp Date": 14,
"Rcv Date": 14,
"Rcpt User": 22,
"Subject ID": 14,
"Qty Asgn": 9,
"IRT Tx": 8,
"Date Asgn": 14,
"Asgn User": 20,
"Disp Status": 16,
"Disp Date": 14,
"Qty Disp": 9,
"Disp User": 20,
"Qty Ret": 10,
"Date Ret": 14,
"Ret User": 18,
"Destroyed": 14,
"Basket No.": 12,
"Max Visit Date": 16,
}
# shipments sheet: kolík kde začínají detail sloupce (1-based, pro format_shipment_sheet)
N_SHIP_COLS = 9
# ── DB ────────────────────────────────────────────────────────────────────────
def get_conn():
return mysql.connector.connect(
host=db_config.DB_HOST, port=db_config.DB_PORT,
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
database=db_config.DB_NAME,
)
def get_latest_import_id(cursor, study):
cursor.execute(
"SELECT MAX(import_id) AS mid FROM iwrs_import WHERE study=%s AND report_type='drugs'",
(study,),
)
row = cursor.fetchone()
mid = row["mid"]
if mid is None:
raise RuntimeError(f"Žádná data v MySQL pro studii {study}")
return mid
# ── Načítání dat z MySQL ──────────────────────────────────────────────────────
def load_inventory(cursor, study, import_id):
"""
Vrátí DataFrame s inventory + destruction join.
Sloupce jsou rovnou přejmenované pro downstream funkce.
"""
sql = """
SELECT
i.site AS Site,
i.medication_id AS `Med ID`,
i.packaged_lot_no AS `Lot No.`,
i.original_expiration_date AS `Orig Exp Date`,
i.expiration_date AS `Exp Date`,
i.received_date AS `Rcv Date`,
i.receipt_user AS `Rcpt User`,
i.subject_identifier AS `Subject ID`,
i.quantity_assigned AS `Qty Asgn`,
i.irt_transaction AS `IRT Tx`,
i.date_assigned AS `Date Asgn`,
i.assignment_user AS `Asgn User`,
i.dispensation_status AS `Disp Status`,
i.dispensing_date AS `Disp Date`,
i.quantity_dispensed AS `Qty Disp`,
i.dispensing_user AS `Disp User`,
i.quantity_returned AS `Qty Ret`,
i.date_returned AS `Date Ret`,
i.return_user AS `Ret User`,
d.destruction_date AS Destroyed,
d.basket_id AS `Basket No.`
FROM iwrs_inventory i
LEFT JOIN (
SELECT medication_id,
ANY_VALUE(basket_id) AS basket_id,
ANY_VALUE(destruction_date) AS destruction_date
FROM iwrs_destruction
WHERE study = %s
GROUP BY medication_id
) d ON d.medication_id = i.medication_id
WHERE i.import_id = %s
AND i.study = %s
ORDER BY i.site, i.received_date, i.medication_id
"""
cursor.execute(sql, (study, import_id, study))
rows = cursor.fetchall()
df = pd.DataFrame(rows)
for col in DATE_COLUMNS:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors="coerce")
print(f" Inventory: {len(df)} kitu")
return df
def load_shipments(cursor, study, import_id):
"""
Vrátí DataFrame se spojenými shipments + items.
"""
sql = """
SELECT
s.shipment_id AS `Shipment ID`,
s.status AS `IRT Shipment Status`,
s.type AS Type,
s.ship_from AS `Shipment From`,
s.ship_to_site AS `Ship To:`,
s.request_date AS `Request Date`,
s.received_date AS `Received Date`,
s.received_by AS `Received by`,
s.expected_arrival AS `Expected Arrival`,
i.investigator AS Investigator,
i.medication_description AS `Medication Description`,
i.medication_id AS `Medication ID`,
i.packaged_lot_no AS `Packaged Lot number`,
i.expiration_date AS `Expiration Date`,
i.item_status AS Status
FROM iwrs_shipments s
JOIN iwrs_shipment_items i
ON i.study = s.study
AND i.shipment_id = s.shipment_id
AND i.import_id = %s
WHERE s.import_id = %s
AND s.study = %s
ORDER BY s.ship_to_site, s.shipment_id, i.medication_id
"""
cursor.execute(sql, (import_id, import_id, study))
rows = cursor.fetchall()
df = pd.DataFrame(rows)
for col in ("Request Date", "Received Date", "Expiration Date", "Expected Arrival"):
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors="coerce")
print(f" Shipments: {df['Shipment ID'].nunique() if len(df) else 0} zásilek, {len(df)} kitu")
return df
# ── Odvozené sheety ───────────────────────────────────────────────────────────
def build_site_summary(shipments_df):
STATUS_COLS = ["Available", "Assigned", "Dispensed", "Returned by Subject"]
pivot = shipments_df.groupby("Ship To:")["Status"].value_counts().unstack(fill_value=0)
for s in STATUS_COLS:
if s not in pivot.columns:
pivot[s] = 0
pivot = (
pivot[STATUS_COLS]
.reset_index()
.rename(columns={"Ship To:": "Site", "Returned by Subject": "Returned"})
.sort_values("Site")
.reset_index(drop=True)
)
pivot["Total"] = pivot[["Available", "Assigned", "Dispensed", "Returned"]].sum(axis=1)
print(f" Site Summary: {len(pivot)} center")
return pivot
def build_expired(df):
today = date.today()
mask = (
df["Basket No."].isna() &
df["Subject ID"].isna() &
(df["Exp Date"] < pd.Timestamp(today))
)
filtered = df[mask].copy().reset_index(drop=True)
sheet_name = f"Expired as of {today.strftime('%d-%b-%Y')}"
print(f" Expired: {len(filtered)}")
return filtered, sheet_name
def build_assigned_not_dispensed(df):
mask = df["Subject ID"].notna() & df["Disp Date"].isna()
filtered = df[mask].copy().reset_index(drop=True)
print(f" Assigned not dispensed: {len(filtered)}")
return filtered
def build_not_returned(df):
no_ret = df[
df["Date Ret"].isna() &
df["Subject ID"].notna() &
(df["Disp Status"].fillna("").str.upper() != "NOT DISPENSED")
].copy()
max_asgn = df.groupby("Subject ID")["Date Asgn"].max().rename("Max Visit Date")
no_ret = no_ret.join(max_asgn, on="Subject ID")
filtered = no_ret[no_ret["Date Asgn"] < no_ret["Max Visit Date"]].copy()
filtered = filtered.drop(columns=["Qty Ret", "Date Ret", "Ret User", "Destroyed", "Basket No."])
filtered = filtered.reset_index(drop=True)
print(f" Not returned: {len(filtered)}")
return filtered
def build_kits_for_destruction(df):
mask = (
df["Basket No."].isna() &
(df["Date Ret"].notna() | (df["Disp Status"].fillna("").str.upper() == "NOT DISPENSED"))
)
filtered = (
df[mask]
.copy()
.sort_values(["Site", "Date Ret"], ascending=[True, True])
.drop(columns=["Destroyed", "Basket No."])
.reset_index(drop=True)
)
print(f" Kits for destruction: {len(filtered)}")
return filtered
# ── Formátování ───────────────────────────────────────────────────────────────
def format_sheet(ws, header_color, highlight_col=None, highlight_color=None):
thin = Side(style="thin", color="000000")
border = Border(left=thin, right=thin, top=thin, bottom=thin)
header_fill = PatternFill("solid", start_color=header_color)
header_font = Font(bold=True, color="FFFFFF", name="Arial", size=10)
row_font = Font(name="Arial", size=10)
hi_fill = PatternFill("solid", start_color=highlight_color) if highlight_color else None
headers = [cell.value for cell in ws[1]]
for cell in ws[1]:
cell.fill = header_fill
cell.font = header_font
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=False)
cell.border = border
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
for cell in row:
col_name = headers[cell.column - 1] if cell.column <= len(headers) else None
cell.font = row_font
cell.border = border
cell.alignment = Alignment(horizontal="center")
if col_name in DATE_COLUMNS:
cell.number_format = "DD-MMM-YYYY"
if hi_fill and col_name == highlight_col:
cell.fill = hi_fill
for cell in ws[1]:
width = COLUMN_WIDTHS.get(cell.value, 14)
ws.column_dimensions[get_column_letter(cell.column)].width = width
ws.auto_filter.ref = ws.dimensions
ws.freeze_panes = "A2"
def format_shipment_sheet(ws, header_color_ship, header_color_detail, n_ship_cols):
thin = Side(style="thin", color="000000")
border = Border(left=thin, right=thin, top=thin, bottom=thin)
hfont = Font(bold=True, color="FFFFFF", name="Arial", size=10)
dfont = Font(name="Arial", size=10)
fill_ship = PatternFill("solid", start_color=header_color_ship)
fill_detail = PatternFill("solid", start_color=header_color_detail)
for cell in ws[1]:
cell.fill = fill_ship if cell.column <= n_ship_cols else fill_detail
cell.font = hfont
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
cell.border = border
ws.column_dimensions[get_column_letter(cell.column)].width = min(
len(str(cell.value or "")) + 4, 35
)
ws.row_dimensions[1].height = 30
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
for cell in row:
cell.font = dfont
cell.border = border
cell.alignment = Alignment(horizontal="center", vertical="center")
if cell.value.__class__.__name__ in ("datetime", "date", "Timestamp"):
cell.number_format = "DD-MMM-YYYY"
ws.auto_filter.ref = ws.dimensions
ws.freeze_panes = "A2"
# ── Main ──────────────────────────────────────────────────────────────────────
def main():
OUTPUT_DIR.mkdir(exist_ok=True)
print(f"\nNačítám data z MySQL pro {STUDY}...")
conn = get_conn()
cursor = conn.cursor(dictionary=True)
import_id = get_latest_import_id(cursor, STUDY)
print(f" import_id = {import_id}")
df = load_inventory(cursor, STUDY, import_id)
shipments_df = load_shipments(cursor, STUDY, import_id)
cursor.close()
conn.close()
expired_df, expired_sheet = build_expired(df)
assigned_df = build_assigned_not_dispensed(df)
not_returned_df = build_not_returned(df)
destruction_df = build_kits_for_destruction(df)
site_summary_df = build_site_summary(shipments_df)
with pd.ExcelWriter(OUTPUT_FILE, engine="openpyxl") as writer:
df.to_excel( writer, index=False, sheet_name="CountryMedicationOverview")
expired_df.to_excel( writer, index=False, sheet_name=expired_sheet)
assigned_df.to_excel( writer, index=False, sheet_name="Assigned not dispensed")
not_returned_df.to_excel( writer, index=False, sheet_name="Not returned")
destruction_df.to_excel( writer, index=False, sheet_name="Kits for destruction")
shipments_df.to_excel( writer, index=False, sheet_name="Shipments")
site_summary_df.to_excel( writer, index=False, sheet_name="Site Summary")
wb = load_workbook(OUTPUT_FILE)
ws_main = wb["CountryMedicationOverview"]
format_sheet(ws_main, header_color="1F4E79")
new_col_fill = PatternFill("solid", start_color="E2EFDA")
headers_main = [c.value for c in ws_main[1]]
for row in ws_main.iter_rows(min_row=2, max_row=ws_main.max_row):
for cell in row:
col_name = headers_main[cell.column - 1] if cell.column <= len(headers_main) else None
if col_name in ("Destroyed", "Basket No."):
cell.fill = new_col_fill
format_sheet(wb[expired_sheet], header_color="C00000", highlight_col="Exp Date", highlight_color="FFE0E0")
format_sheet(wb["Assigned not dispensed"], header_color="833C00", highlight_col="Subject ID", highlight_color="FFF2CC")
format_sheet(wb["Not returned"], header_color="375623", highlight_col="Max Visit Date", highlight_color="E2EFDA")
format_sheet(wb["Kits for destruction"], header_color="595959")
format_shipment_sheet(wb["Shipments"], "1F4E79", "375623", N_SHIP_COLS)
format_sheet(wb["Site Summary"], header_color="1F4E79")
wb.save(OUTPUT_FILE)
print(f"\nUloženo: {OUTPUT_FILE} ({len(df)} řádků, sheety: {wb.sheetnames})")
if __name__ == "__main__":
main()
@@ -0,0 +1,205 @@
import sys
import os
import mysql.connector
import openpyxl
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
from datetime import date
import pandas as pd
# db_config.py je v nadřazeném adresáři (Drugs/)
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), ".."))
import db_config
STUDY = "77242113UCO3001"
OUTPUT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "output")
os.makedirs(OUTPUT_DIR, exist_ok=True)
def get_conn():
return mysql.connector.connect(
host=db_config.DB_HOST, port=db_config.DB_PORT,
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
database=db_config.DB_NAME,
)
def load_data(study):
conn = get_conn()
cursor = conn.cursor(dictionary=True)
# nejnovější import_id pro danou studii
cursor.execute(
"SELECT MAX(import_id) AS mid FROM iwrs_import WHERE study=%s AND report_type='drugs'",
(study,),
)
row = cursor.fetchone()
import_id = row["mid"]
if import_id is None:
raise RuntimeError(f"Žádná data v MySQL pro studii {study}")
print(f" import_id = {import_id}")
sql = """
SELECT
s.shipment_id,
s.status AS irt_shipment_status,
s.type,
s.ship_from AS shipment_from,
s.ship_to_site AS ship_to,
s.request_date,
s.received_date,
s.received_by,
s.expected_arrival,
i.investigator,
i.medication_description,
i.medication_id,
i.packaged_lot_no,
i.expiration_date,
i.item_status AS status
FROM iwrs_shipments s
JOIN iwrs_shipment_items i
ON i.study = s.study
AND i.shipment_id = s.shipment_id
AND i.import_id = %s
WHERE s.import_id = %s
AND s.study = %s
ORDER BY s.ship_to_site, s.shipment_id, i.medication_id
"""
cursor.execute(sql, (import_id, import_id, study))
rows = cursor.fetchall()
cursor.close()
conn.close()
print(f" Načteno řádků: {len(rows)}")
return rows
# shipment sloupce (modrý header) / detail sloupce (zelený header)
SHIP_COLS = [
("shipment_id", "Shipment ID"),
("irt_shipment_status","IRT Shipment Status"),
("type", "Type"),
("shipment_from", "Shipment From"),
("ship_to", "Ship To:"),
("request_date", "Request Date"),
("received_date", "Received Date"),
("received_by", "Received by"),
("expected_arrival", "Expected Arrival"),
]
DETAIL_COLS = [
("investigator", "Investigator"),
("medication_description", "Medication Description"),
("medication_id", "Medication ID"),
("packaged_lot_no", "Packaged Lot number"),
("expiration_date", "Expiration Date"),
("status", "Status"),
]
ALL_COLS = SHIP_COLS + DETAIL_COLS
N_SHIP_COLS = len(SHIP_COLS)
HEADER_FILL_SHIP = PatternFill("solid", fgColor="1F4E79")
HEADER_FILL_DETAIL = PatternFill("solid", fgColor="375623")
HEADER_FONT = Font(name="Arial", bold=True, color="FFFFFF", size=10)
DATA_FONT = Font(name="Arial", size=10)
THIN_BORDER = Border(
left=Side(style="thin", color="BFBFBF"),
right=Side(style="thin", color="BFBFBF"),
bottom=Side(style="thin", color="BFBFBF"),
)
def write_shipments_sheet(wb, rows):
ws = wb.active
ws.title = "Shipments"
# záhlaví
for ci, (_, label) in enumerate(ALL_COLS, 1):
cell = ws.cell(row=1, column=ci, value=label)
cell.font = HEADER_FONT
cell.fill = HEADER_FILL_SHIP if ci <= N_SHIP_COLS else HEADER_FILL_DETAIL
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
cell.border = THIN_BORDER
ws.row_dimensions[1].height = 30
# data
for ri, row in enumerate(rows, 2):
for ci, (key, _) in enumerate(ALL_COLS, 1):
val = row[key]
cell = ws.cell(row=ri, column=ci, value=val)
cell.font = DATA_FONT
cell.border = THIN_BORDER
cell.alignment = Alignment(horizontal="center", vertical="center")
if isinstance(val, date):
cell.number_format = "DD-MMM-YYYY"
ws.auto_filter.ref = ws.dimensions
ws.freeze_panes = "A2"
# šířky sloupců
for ci, (key, label) in enumerate(ALL_COLS, 1):
vals = [label] + [str(r[key]) for r in rows if r[key] is not None]
ws.column_dimensions[get_column_letter(ci)].width = min(
max((len(v) for v in vals), default=10) + 2, 35
)
def write_summary_sheet(wb, rows):
STATUS_COLS = ["Available", "Assigned", "Dispensed", "Returned by Subject"]
df = pd.DataFrame(rows)
pivot = df.groupby("ship_to")["status"].value_counts().unstack(fill_value=0)
for s in STATUS_COLS:
if s not in pivot.columns:
pivot[s] = 0
pivot = (
pivot[STATUS_COLS]
.reset_index()
.rename(columns={"ship_to": "Site", "Returned by Subject": "Returned"})
.sort_values("Site")
.reset_index(drop=True)
)
pivot["Total"] = pivot[["Available", "Assigned", "Dispensed", "Returned"]].sum(axis=1)
ws = wb.create_sheet("Site Summary")
s_cols = ["Site", "Available", "Assigned", "Dispensed", "Returned", "Total"]
for ci, col in enumerate(s_cols, 1):
cell = ws.cell(row=1, column=ci, value=col)
cell.font = HEADER_FONT
cell.fill = PatternFill("solid", fgColor="1F4E79")
cell.alignment = Alignment(horizontal="center", vertical="center")
cell.border = THIN_BORDER
ws.row_dimensions[1].height = 25
for ri, (_, row) in enumerate(pivot.iterrows(), 2):
for ci, col in enumerate(s_cols, 1):
cell = ws.cell(row=ri, column=ci, value=row[col])
cell.font = DATA_FONT
cell.border = THIN_BORDER
cell.alignment = Alignment(horizontal="center", vertical="center")
for ci, col in enumerate(s_cols, 1):
vals = [col] + [str(pivot.iloc[r][col]) for r in range(len(pivot))]
ws.column_dimensions[get_column_letter(ci)].width = min(
max(len(v) for v in vals) + 4, 35
)
ws.freeze_panes = "A2"
def build_report():
print(f"\nNačítám data z MySQL pro {STUDY}...")
rows = load_data(STUDY)
wb = openpyxl.Workbook()
write_shipments_sheet(wb, rows)
write_summary_sheet(wb, rows)
outfile = os.path.join(OUTPUT_DIR, f"{date.today()} {STUDY} CZ Shipments.xlsx")
wb.save(outfile)
print(f"\nUloženo -> {outfile}")
build_report()
@@ -0,0 +1,393 @@
import sys
import os
import mysql.connector
import pandas as pd
from datetime import date
from pathlib import Path
from openpyxl import load_workbook
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
from openpyxl.utils import get_column_letter
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), ".."))
import db_config
STUDIES = [
("77242113UCO3001", "UCO"),
("42847922MDD3003", "MDD"),
]
BASE_DIR = Path(os.path.dirname(os.path.abspath(__file__)))
OUTPUT_DIR = BASE_DIR / "output"
DATE_COLUMNS = {
"Orig Exp Date", "Exp Date", "Rcv Date",
"Date Asgn", "Disp Date", "Date Ret", "Destroyed", "Max Visit Date",
}
COLUMN_WIDTHS = {
"Site": 14,
"Med ID": 10,
"Lot No.": 12,
"Orig Exp Date": 16,
"Exp Date": 14,
"Rcv Date": 14,
"Rcpt User": 22,
"Subject ID": 14,
"Qty Asgn": 9,
"IRT Tx": 8,
"Date Asgn": 14,
"Asgn User": 20,
"Disp Status": 16,
"Disp Date": 14,
"Qty Disp": 9,
"Disp User": 20,
"Qty Ret": 10,
"Date Ret": 14,
"Ret User": 18,
"Destroyed": 14,
"Basket No.": 12,
"Max Visit Date": 16,
}
N_SHIP_COLS = 9 # počet shipment sloupců (modrý header v Shipments sheetu)
# ── DB ────────────────────────────────────────────────────────────────────────
def get_conn():
return mysql.connector.connect(
host=db_config.DB_HOST, port=db_config.DB_PORT,
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
database=db_config.DB_NAME,
)
def get_latest_import_id(cursor, study):
cursor.execute(
"SELECT MAX(import_id) AS mid FROM iwrs_import WHERE study=%s AND report_type='drugs'",
(study,),
)
row = cursor.fetchone()
mid = row["mid"]
if mid is None:
raise RuntimeError(f"Žádná data v MySQL pro studii {study}")
return mid
# ── Načítání dat ──────────────────────────────────────────────────────────────
def load_inventory(cursor, study, import_id):
sql = """
SELECT
i.site AS Site,
i.medication_id AS `Med ID`,
i.packaged_lot_no AS `Lot No.`,
i.original_expiration_date AS `Orig Exp Date`,
i.expiration_date AS `Exp Date`,
i.received_date AS `Rcv Date`,
i.receipt_user AS `Rcpt User`,
i.subject_identifier AS `Subject ID`,
i.quantity_assigned AS `Qty Asgn`,
i.irt_transaction AS `IRT Tx`,
i.date_assigned AS `Date Asgn`,
i.assignment_user AS `Asgn User`,
i.dispensation_status AS `Disp Status`,
i.dispensing_date AS `Disp Date`,
i.quantity_dispensed AS `Qty Disp`,
i.dispensing_user AS `Disp User`,
i.quantity_returned AS `Qty Ret`,
i.date_returned AS `Date Ret`,
i.return_user AS `Ret User`,
d.destruction_date AS Destroyed,
d.basket_id AS `Basket No.`
FROM iwrs_inventory i
LEFT JOIN (
SELECT medication_id,
ANY_VALUE(basket_id) AS basket_id,
ANY_VALUE(destruction_date) AS destruction_date
FROM iwrs_destruction
WHERE study = %s
GROUP BY medication_id
) d ON d.medication_id = i.medication_id
WHERE i.import_id = %s
AND i.study = %s
ORDER BY i.site, i.received_date, i.medication_id
"""
cursor.execute(sql, (study, import_id, study))
rows = cursor.fetchall()
df = pd.DataFrame(rows)
for col in DATE_COLUMNS:
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors="coerce")
print(f" Inventory: {len(df)} kitu")
return df
def load_shipments(cursor, study, import_id):
sql = """
SELECT
s.shipment_id AS `Shipment ID`,
s.status AS `IRT Shipment Status`,
s.type AS Type,
s.ship_from AS `Shipment From`,
s.ship_to_site AS `Ship To:`,
s.request_date AS `Request Date`,
s.received_date AS `Received Date`,
s.received_by AS `Received by`,
s.expected_arrival AS `Expected Arrival`,
i.investigator AS Investigator,
i.medication_description AS `Medication Description`,
i.medication_id AS `Medication ID`,
i.packaged_lot_no AS `Packaged Lot number`,
i.expiration_date AS `Expiration Date`,
i.item_status AS Status
FROM iwrs_shipments s
JOIN iwrs_shipment_items i
ON i.study = s.study
AND i.shipment_id = s.shipment_id
AND i.import_id = %s
WHERE s.import_id = %s
AND s.study = %s
ORDER BY s.ship_to_site, s.shipment_id, i.medication_id
"""
cursor.execute(sql, (import_id, import_id, study))
rows = cursor.fetchall()
df = pd.DataFrame(rows)
for col in ("Request Date", "Received Date", "Expiration Date", "Expected Arrival"):
if col in df.columns:
df[col] = pd.to_datetime(df[col], errors="coerce")
n_ship = df["Shipment ID"].nunique() if len(df) else 0
print(f" Shipments: {n_ship} zásilek, {len(df)} kitu")
return df
# ── Odvozené sheety ───────────────────────────────────────────────────────────
def build_site_summary(shipments_df):
STATUS_COLS = ["Available", "Assigned", "Dispensed", "Returned by Subject"]
pivot = shipments_df.groupby("Ship To:")["Status"].value_counts().unstack(fill_value=0)
for s in STATUS_COLS:
if s not in pivot.columns:
pivot[s] = 0
pivot = (
pivot[STATUS_COLS]
.reset_index()
.rename(columns={"Ship To:": "Site", "Returned by Subject": "Returned"})
.sort_values("Site")
.reset_index(drop=True)
)
pivot["Total"] = pivot[["Available", "Assigned", "Dispensed", "Returned"]].sum(axis=1)
print(f" Site Summary: {len(pivot)} center")
return pivot
def build_expired(df):
today = date.today()
mask = (
df["Basket No."].isna() &
df["Subject ID"].isna() &
(df["Exp Date"] < pd.Timestamp(today))
)
filtered = df[mask].copy().reset_index(drop=True)
print(f" Expired: {len(filtered)}")
return filtered
def build_assigned_not_dispensed(df):
mask = df["Subject ID"].notna() & df["Disp Date"].isna()
filtered = df[mask].copy().reset_index(drop=True)
print(f" Assigned not dispensed: {len(filtered)}")
return filtered
def build_not_returned(df):
no_ret = df[
df["Date Ret"].isna() &
df["Subject ID"].notna() &
(df["Disp Status"].fillna("").str.upper() != "NOT DISPENSED")
].copy()
max_asgn = df.groupby("Subject ID")["Date Asgn"].max().rename("Max Visit Date")
no_ret = no_ret.join(max_asgn, on="Subject ID")
filtered = no_ret[no_ret["Date Asgn"] < no_ret["Max Visit Date"]].copy()
filtered = filtered.drop(columns=["Qty Ret", "Date Ret", "Ret User", "Destroyed", "Basket No."])
filtered = filtered.reset_index(drop=True)
print(f" Not returned: {len(filtered)}")
return filtered
def build_kits_for_destruction(df):
mask = (
df["Basket No."].isna() &
(df["Date Ret"].notna() | (df["Disp Status"].fillna("").str.upper() == "NOT DISPENSED"))
)
filtered = (
df[mask]
.copy()
.sort_values(["Site", "Date Ret"], ascending=[True, True])
.drop(columns=["Destroyed", "Basket No."])
.reset_index(drop=True)
)
print(f" Kits for destruction: {len(filtered)}")
return filtered
# ── Formátování ───────────────────────────────────────────────────────────────
def format_sheet(ws, header_color, highlight_col=None, highlight_color=None):
thin = Side(style="thin", color="000000")
border = Border(left=thin, right=thin, top=thin, bottom=thin)
header_fill = PatternFill("solid", start_color=header_color)
header_font = Font(bold=True, color="FFFFFF", name="Arial", size=10)
row_font = Font(name="Arial", size=10)
hi_fill = PatternFill("solid", start_color=highlight_color) if highlight_color else None
headers = [cell.value for cell in ws[1]]
for cell in ws[1]:
cell.fill = header_fill
cell.font = header_font
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=False)
cell.border = border
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
for cell in row:
col_name = headers[cell.column - 1] if cell.column <= len(headers) else None
cell.font = row_font
cell.border = border
cell.alignment = Alignment(horizontal="center")
if col_name in DATE_COLUMNS:
cell.number_format = "DD-MMM-YYYY"
if hi_fill and col_name == highlight_col:
cell.fill = hi_fill
for cell in ws[1]:
width = COLUMN_WIDTHS.get(cell.value, 14)
ws.column_dimensions[get_column_letter(cell.column)].width = width
ws.auto_filter.ref = ws.dimensions
ws.freeze_panes = "A2"
def format_overview_sheet(ws):
format_sheet(ws, header_color="1F4E79")
new_col_fill = PatternFill("solid", start_color="E2EFDA")
headers = [c.value for c in ws[1]]
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
for cell in row:
col_name = headers[cell.column - 1] if cell.column <= len(headers) else None
if col_name in ("Destroyed", "Basket No."):
cell.fill = new_col_fill
def format_shipment_sheet(ws):
thin = Side(style="thin", color="000000")
border = Border(left=thin, right=thin, top=thin, bottom=thin)
hfont = Font(bold=True, color="FFFFFF", name="Arial", size=10)
dfont = Font(name="Arial", size=10)
fill_ship = PatternFill("solid", start_color="1F4E79")
fill_detail = PatternFill("solid", start_color="375623")
for cell in ws[1]:
cell.fill = fill_ship if cell.column <= N_SHIP_COLS else fill_detail
cell.font = hfont
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
cell.border = border
ws.column_dimensions[get_column_letter(cell.column)].width = min(
len(str(cell.value or "")) + 4, 35
)
ws.row_dimensions[1].height = 30
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
for cell in row:
cell.font = dfont
cell.border = border
cell.alignment = Alignment(horizontal="center", vertical="center")
if cell.value.__class__.__name__ in ("datetime", "date", "Timestamp"):
cell.number_format = "DD-MMM-YYYY"
ws.auto_filter.ref = ws.dimensions
ws.freeze_panes = "A2"
# ── Main ──────────────────────────────────────────────────────────────────────
SHEETS_DEF = [
("CountryMedicationOverview", "overview"),
("Expired", "expired"),
("Assigned not dispensed", "assigned"),
("Not returned", "not_returned"),
("Kits for destruction", "destruction"),
("Shipments", "shipments"),
("Site Summary", "site_summary"),
]
FORMAT_MAP = {
"overview": lambda ws: format_overview_sheet(ws),
"expired": lambda ws: format_sheet(ws, "C00000", "Exp Date", "FFE0E0"),
"assigned": lambda ws: format_sheet(ws, "833C00", "Subject ID", "FFF2CC"),
"not_returned": lambda ws: format_sheet(ws, "375623", "Max Visit Date", "E2EFDA"),
"destruction": lambda ws: format_sheet(ws, "595959"),
"shipments": lambda ws: format_shipment_sheet(ws),
"site_summary": lambda ws: format_sheet(ws, "1F4E79"),
}
def process_study(cursor, study):
today = date.today().strftime("%d-%b-%Y")
import_id = get_latest_import_id(cursor, study)
print(f" import_id = {import_id}")
df = load_inventory(cursor, study, import_id)
shipments_df = load_shipments(cursor, study, import_id)
expired_df = build_expired(df)
assigned_df = build_assigned_not_dispensed(df)
not_returned_df = build_not_returned(df)
destruction_df = build_kits_for_destruction(df)
site_summ_df = build_site_summary(shipments_df)
return [
df, expired_df, assigned_df, not_returned_df,
destruction_df, shipments_df, site_summ_df,
]
def save_study_report(study, data_frames):
output_file = OUTPUT_DIR / f"{date.today().strftime('%Y-%m-%d')} {study} report.xlsx"
with pd.ExcelWriter(output_file, engine="openpyxl") as writer:
for (sheet_name, _), df_sheet in zip(SHEETS_DEF, data_frames):
df_sheet.to_excel(writer, index=False, sheet_name=sheet_name)
wb = load_workbook(output_file)
for (sheet_name, fmt_key) in SHEETS_DEF:
FORMAT_MAP[fmt_key](wb[sheet_name])
wb.save(output_file)
print(f" Uloženo: {output_file}")
def main():
OUTPUT_DIR.mkdir(exist_ok=True)
conn = get_conn()
cursor = conn.cursor(dictionary=True)
for study, _ in STUDIES:
print(f"\n{'='*55}")
print(f"[{study}]")
print(f"{'='*55}")
try:
data_frames = process_study(cursor, study)
save_study_report(study, data_frames)
except Exception as e:
import traceback
print(f" CHYBA: {e}")
traceback.print_exc()
cursor.close()
conn.close()
print(f"\nHotovo.")
if __name__ == "__main__":
main()
@@ -0,0 +1,76 @@
from playwright.sync_api import sync_playwright
import os
# ── CONFIG ──────────────────────────────────────────────────────────────────
BASE_URL = "https://janssen.4gclinical.com"
EMAIL = "vbuzalka@its.jnj.com"
PASSWORD = "Vlado123++-+"
# STUDY = "42847922MDD3003"
STUDY = "77242113UCO3001"
OUTPUT_DIR = f"xls_ip_destruction_{STUDY}"
# ────────────────────────────────────────────────────────────────────────────
def run(page, study):
output_dir = f"xls_ip_destruction_{study}"
os.makedirs(output_dir, exist_ok=True)
page.goto(f"{BASE_URL}/report/ip_destruction_form")
page.wait_for_load_state("networkidle", timeout=120000)
page.locator('input[placeholder="search"], input[type="text"]').first.click()
page.wait_for_timeout(1000)
baskets = [b.strip() for b in page.locator('mat-option').all_inner_texts()
if b.strip() and b.strip() != "No results found"]
print(f" Nalezeno {len(baskets)} kosiku: {baskets}")
page.keyboard.press("Escape")
page.wait_for_timeout(500)
if not baskets:
print(" Zadne destruction kosite — preskakuji.")
return
for basket in baskets:
filename = os.path.join(output_dir, f"ip_destruction_basket_{basket}.xlsx")
if os.path.exists(filename):
print(f" [{basket}] Preskakuji — existuje.")
continue
print(f" [{basket}] Stahuji...")
input_field = page.locator('input[placeholder="search"], input[type="text"]').first
input_field.click()
input_field.fill(basket)
page.wait_for_timeout(500)
page.locator('mat-option').first.dispatch_event('click')
page.wait_for_load_state("networkidle", timeout=120000)
with page.expect_download(timeout=120000) as dl:
page.get_by_role("button", name="Download XLS").click()
dl.value.save_as(filename)
print(f" [{basket}] OK")
page.get_by_role("button", name="Clear").click()
page.wait_for_load_state("networkidle", timeout=120000)
print(" Destruction hotovo.")
if __name__ == "__main__":
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(accept_downloads=True)
page = context.new_page()
page.goto(BASE_URL)
page.wait_for_load_state("networkidle")
page.get_by_label("Email *").fill(EMAIL)
page.get_by_label("Password *").fill(PASSWORD)
page.locator('#login__submit').click()
page.wait_for_load_state("networkidle")
page.get_by_label("Study *").click()
page.get_by_role("option", name=STUDY).click()
page.get_by_role("button", name="SELECT").click()
page.wait_for_load_state("networkidle")
run(page, STUDY)
browser.close()
@@ -0,0 +1,83 @@
from playwright.sync_api import sync_playwright
import os
# ── CONFIG ──────────────────────────────────────────────────────────────────
BASE_URL = "https://janssen.4gclinical.com"
EMAIL = "vbuzalka@its.jnj.com"
PASSWORD = "Vlado123++-+"
# STUDY = "42847922MDD3003"
STUDY = "77242113UCO3001"
SITES = {
"42847922MDD3003": [
"S10-CZ10002",
"S10-CZ10004",
"S10-CZ10005",
"S10-CZ10008",
"S10-CZ10011",
"S10-CZ10012",
],
"77242113UCO3001": [
"DD5-CZ10001",
"DD5-CZ10003",
"DD5-CZ10006",
"DD5-CZ10009",
"DD5-CZ10010",
"DD5-CZ10012",
"DD5-CZ10013",
"DD5-CZ10015",
"DD5-CZ10016",
"DD5-CZ10020",
"DD5-CZ10021",
"DD5-CZ10022",
],
}
OUTPUT_DIR = f"xls_reports_{STUDY}"
# ────────────────────────────────────────────────────────────────────────────
def run(page, study):
output_dir = f"xls_reports_{study}"
os.makedirs(output_dir, exist_ok=True)
page.goto(f"{BASE_URL}/report/onsite_inventory_detail")
page.wait_for_load_state("networkidle", timeout=120000)
for site_id in SITES[study]:
print(f" [{site_id}] Stahuji...")
page.locator('input[placeholder="search"], input[type="text"]').first.click()
page.get_by_role("option", name=site_id).click()
page.wait_for_load_state("networkidle", timeout=120000)
with page.expect_download(timeout=120000) as dl:
page.get_by_role("button", name="Download XLS").click()
dl.value.save_as(os.path.join(output_dir, f"onsite_inventory_detail_{site_id}.xlsx"))
print(f" [{site_id}] OK")
page.get_by_role("button", name="Clear").click()
page.wait_for_load_state("networkidle", timeout=120000)
print(" Inventory hotovo.")
if __name__ == "__main__":
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(accept_downloads=True)
page = context.new_page()
page.goto(BASE_URL)
page.wait_for_load_state("networkidle")
page.get_by_label("Email *").fill(EMAIL)
page.get_by_label("Password *").fill(PASSWORD)
page.locator('#login__submit').click()
page.wait_for_load_state("networkidle")
page.get_by_label("Study *").click()
page.get_by_role("option", name=STUDY).click()
page.get_by_role("button", name="SELECT").click()
page.wait_for_load_state("networkidle")
run(page, STUDY)
browser.close()
@@ -0,0 +1,95 @@
from playwright.sync_api import sync_playwright
import os
import pandas as pd
# ── CONFIG ──────────────────────────────────────────────────────────────────
BASE_URL = "https://janssen.4gclinical.com"
EMAIL = "vbuzalka@its.jnj.com"
PASSWORD = "Vlado123++-+"
STUDY = "42847922MDD3003"
#STUDY = "77242113UCO3001"
OUTPUT_DIR = f"xls_shipment_details_{STUDY}"
# ────────────────────────────────────────────────────────────────────────────
def get_cz_shipment_ids(study):
path = f"xls_shipments_{study}/shipments_report_{study}.xlsx"
if not os.path.exists(path):
return None
df = pd.read_excel(path, header=5)
df.columns = df.columns.str.strip()
df = df.dropna(how="all")
df["Shipment ID"] = df["Shipment ID"].astype(str).str.strip()
cz = df[df["Location"].str.contains("Czech", na=False, case=False)]
return cz["Shipment ID"].tolist()
def run(page, study):
output_dir = f"xls_shipment_details_{study}"
os.makedirs(output_dir, exist_ok=True)
page.goto(f"{BASE_URL}/report/shipment_details_report")
page.wait_for_load_state("networkidle", timeout=120000)
cz_ids = get_cz_shipment_ids(study)
if cz_ids is not None:
shipments = cz_ids
print(f" Filtrovano ze shipments reportu: {len(shipments)} CZ shipmentu")
else:
page.locator('input[placeholder="search"], input[type="text"]').first.click()
page.wait_for_timeout(1000)
shipments = [s.strip() for s in page.locator('mat-option').all_inner_texts()
if s.strip() and s.strip() != "No results found"]
print(f" Nalezeno {len(shipments)} shipmentu z dropdownu")
page.keyboard.press("Escape")
page.wait_for_timeout(500)
if not shipments:
print(" Zadne shipments — preskakuji.")
return
for shipment in shipments:
filename = os.path.join(output_dir, f"shipment_details_{shipment}.xlsx")
if os.path.exists(filename):
print(f" [{shipment}] Preskakuji — existuje.")
continue
print(f" [{shipment}] Stahuji...")
input_field = page.locator('input[placeholder="search"], input[type="text"]').first
input_field.click()
input_field.fill(shipment)
page.wait_for_timeout(500)
page.locator('mat-option').first.dispatch_event('click')
page.wait_for_load_state("networkidle", timeout=120000)
with page.expect_download(timeout=120000) as dl:
page.get_by_role("button", name="Download XLS").click()
dl.value.save_as(filename)
print(f" [{shipment}] OK")
page.get_by_role("button", name="Clear").click()
page.wait_for_load_state("networkidle", timeout=120000)
print(" Shipment details hotovo.")
if __name__ == "__main__":
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(accept_downloads=True)
page = context.new_page()
page.goto(BASE_URL)
page.wait_for_load_state("networkidle")
page.get_by_label("Email *").fill(EMAIL)
page.get_by_label("Password *").fill(PASSWORD)
page.locator('#login__submit').click()
page.wait_for_load_state("networkidle")
page.get_by_label("Study *").click()
page.get_by_role("option", name=STUDY).click()
page.get_by_role("button", name="SELECT").click()
page.wait_for_load_state("networkidle")
run(page, STUDY)
browser.close()
@@ -0,0 +1,47 @@
from playwright.sync_api import sync_playwright
import os
# ── CONFIG ──────────────────────────────────────────────────────────────────
BASE_URL = "https://janssen.4gclinical.com"
EMAIL = "vbuzalka@its.jnj.com"
PASSWORD = "Vlado123++-+"
# STUDY = "42847922MDD3003"
STUDY = "77242113UCO3001"
OUTPUT_DIR = f"xls_shipments_{STUDY}"
# ────────────────────────────────────────────────────────────────────────────
def run(page, study):
output_dir = f"xls_shipments_{study}"
os.makedirs(output_dir, exist_ok=True)
page.goto(f"{BASE_URL}/report/shipments_report")
page.wait_for_load_state("networkidle", timeout=120000)
filename = os.path.join(output_dir, f"shipments_report_{study}.xlsx")
with page.expect_download(timeout=120000) as dl:
page.get_by_role("button", name="Download XLS").click()
dl.value.save_as(filename)
print(f" Shipments report OK -> {filename}")
if __name__ == "__main__":
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(accept_downloads=True)
page = context.new_page()
page.goto(BASE_URL)
page.wait_for_load_state("networkidle")
page.get_by_label("Email *").fill(EMAIL)
page.get_by_label("Password *").fill(PASSWORD)
page.locator('#login__submit').click()
page.wait_for_load_state("networkidle")
page.get_by_label("Study *").click()
page.get_by_role("option", name=STUDY).click()
page.get_by_role("button", name="SELECT").click()
page.wait_for_load_state("networkidle")
run(page, STUDY)
browser.close()
@@ -0,0 +1,441 @@
"""
Importuje drugs data z IWRS Excel reportů do MySQL.
Tabulky:
iwrs_shipments — zásilky (jen CZ, verzováno import_id)
iwrs_shipment_items — obsah zásilek (verzováno import_id)
iwrs_inventory — lékový sklad na centrech (verzováno import_id)
iwrs_destruction — destrukce (bez verzování, přeskočí již importované košíky)
Spustit po stažení souborů (nebo přes run_all.py).
"""
import os
import glob
import re
import datetime
import numpy as np
import pandas as pd
import mysql.connector
import db_config
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
STUDIES = ["77242113UCO3001", "42847922MDD3003"]
SITES = {
"77242113UCO3001": [
"DD5-CZ10001", "DD5-CZ10003", "DD5-CZ10006", "DD5-CZ10009",
"DD5-CZ10010", "DD5-CZ10012", "DD5-CZ10013", "DD5-CZ10015",
"DD5-CZ10016", "DD5-CZ10020", "DD5-CZ10021", "DD5-CZ10022",
],
"42847922MDD3003": [
"S10-CZ10002", "S10-CZ10004", "S10-CZ10005",
"S10-CZ10008", "S10-CZ10011", "S10-CZ10012",
],
}
# ── type converters ──────────────────────────────────────────────────────────
def _py(val):
if isinstance(val, np.generic):
return val.item()
return val
def to_date(val):
val = _py(val)
if val is None:
return None
if isinstance(val, float) and (val != val):
return None
try:
if pd.isna(val):
return None
except (TypeError, ValueError):
pass
if isinstance(val, pd.Timestamp):
return None if pd.isna(val) else val.date()
if isinstance(val, datetime.datetime):
return val.date()
if isinstance(val, datetime.date):
return val
s = str(val).strip()
if not s or s.lower() in ("nat", "nan", "none", ""):
return None
for fmt in ("%Y-%m-%d", "%d-%b-%Y", "%d-%m-%Y", "%Y-%m-%d %H:%M:%S"):
try:
return datetime.datetime.strptime(s, fmt).date()
except ValueError:
pass
return None
def to_int(val):
val = _py(val)
try:
v = float(val)
return None if (v != v) else int(v)
except (TypeError, ValueError):
return None
def to_str(val):
val = _py(val)
if val is None:
return None
if isinstance(val, float) and (val != val):
return None
s = str(val).strip()
return None if s.lower() in ("nan", "nat", "none", "") else s
# ── DB helpers ───────────────────────────────────────────────────────────────
def get_conn():
return mysql.connector.connect(
host=db_config.DB_HOST, port=db_config.DB_PORT,
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
database=db_config.DB_NAME,
)
def insert_import(cursor, study, source_label):
cursor.execute(
"INSERT INTO iwrs_import (study, imported_at, source_file, report_type) VALUES (%s, %s, %s, %s)",
(study, datetime.datetime.now(), source_label, "drugs"),
)
return cursor.lastrowid
def basket_already_imported(cursor, study, basket_id):
cursor.execute(
"SELECT 1 FROM iwrs_destruction WHERE study=%s AND basket_id=%s LIMIT 1",
(study, str(basket_id)),
)
return cursor.fetchone() is not None
# ── parsers ──────────────────────────────────────────────────────────────────
def parse_shipments_report(study):
path = os.path.join(BASE_DIR, f"xls_shipments_{study}", f"shipments_report_{study}.xlsx")
if not os.path.exists(path):
print(f" CHYBÍ: {path}")
return []
raw = pd.read_excel(path, header=None)
header_row = None
for i, row in raw.iterrows():
if "Shipment ID" in [str(v).strip() for v in row]:
header_row = i
break
if header_row is None:
return []
df = pd.read_excel(path, header=header_row)
df = df.dropna(how="all")
# pouze CZ zásilky
df = df[df["Location"].astype(str).str.contains("Czech", na=False, case=False)]
col = df.columns.tolist()
rows = []
for _, r in df.iterrows():
rows.append({
"shipment_id": to_str(r["Shipment ID"]),
"status": to_str(r["IRT Shipment Status"]),
"type": to_str(r["Type"]),
"ship_from": to_str(r["Shipment From"]),
"ship_to_site": to_str(r["Ship To:"]),
"location": to_str(r["Location"]),
"request_date": to_date(r["Request Date"]),
"shipped_date": to_date(r["Shipped Date"]),
"received_date": to_date(r["Received Date"]) if "Received Date" in col else None,
"received_by": to_str(r["Received by"]) if "Received by" in col else None,
"delivered_date_utc": to_date(r["Delivered Date [UTC]"]) if "Delivered Date [UTC]" in col else None,
"delivery_recipient": to_str(r["Delivery Recipient"]) if "Delivery Recipient" in col else None,
"delivery_details": to_str(r["Delivery Details"]) if "Delivery Details" in col else None,
"cancelled_date": to_date(r["Cancelled Date"]) if "Cancelled Date" in col else None,
"total_medication_ids": to_int(r["Total Medication IDs"]) if "Total Medication IDs" in col else None,
"tracking_no": to_str(r["Tracking #"]) if "Tracking #" in col else None,
"shipping_category": to_str(r["Shipping Category"]) if "Shipping Category" in col else None,
"expected_arrival": to_date(r["Expected Arrival"]) if "Expected Arrival" in col else None,
})
return rows
def parse_shipment_details(study):
detail_dir = os.path.join(BASE_DIR, f"xls_shipment_details_{study}")
files = sorted(glob.glob(os.path.join(detail_dir, "shipment_details_*.xlsx")))
rows = []
for path in files:
# shipment ID z názvu souboru
m = re.search(r"shipment_details_(.+)\.xlsx", os.path.basename(path))
shipment_id = m.group(1) if m else "UNKNOWN"
raw = pd.read_excel(path, header=None)
header_row = None
for i, row in raw.iterrows():
if "Medication ID" in [str(v).strip() for v in row]:
header_row = i
break
if header_row is None:
continue
df = pd.read_excel(path, header=header_row)
df = df.dropna(how="all")
col = df.columns.tolist()
for _, r in df.iterrows():
# normalizace názvů sloupců lišících se mezi studiemi
med_desc = (to_str(r.get("Medication Description"))
or to_str(r.get("Medication ID Description")))
med_type = (to_str(r.get("Medication type"))
or to_str(r.get("Medication ID type")))
rows.append({
"shipment_id": shipment_id,
"destination_location": to_str(r.get("Destination Location")),
"shipment_status": to_str(r.get("IRT Shipment Status")),
"shipment_type": to_str(r.get("Type")),
"destination_site": to_str(r.get("Destination Site")),
"investigator": to_str(r.get("Investigator")),
"medication_description": med_desc,
"medication_type": med_type,
"medication_id": to_str(r.get("Medication ID")),
"packaged_lot_no": to_str(r.get("Packaged Lot number")),
"packaged_lot_description": to_str(r.get("Packaged Lot description")),
"container_id": to_str(r.get("Container ID")),
"quantity": to_int(r.get("Quantity of Medication IDs")),
"expiration_date": to_date(r.get("Expiration Date")),
"item_status": to_str(r.get("Status")),
})
return rows
def parse_inventory(study):
inv_dir = os.path.join(BASE_DIR, f"xls_reports_{study}")
files = sorted(glob.glob(os.path.join(inv_dir, "onsite_inventory_detail_*.xlsx")))
rows = []
for path in files:
raw = pd.read_excel(path, header=None)
# extrahuj metadata ze záhlaví
site = investigator = location = None
header_row = None
for i, row in raw.iterrows():
first = str(row.iloc[0]).strip() if pd.notna(row.iloc[0]) else ""
if first.startswith("Site:"):
site = first.replace("Site:", "").strip()
elif first.startswith("Investigator:"):
investigator = first.replace("Investigator:", "").strip()
elif first.startswith("Location:"):
location = first.replace("Location:", "").strip()
# hlavička dat — první sloupec je "Medication" nebo "Medication ID"
if first in ("Medication", "Medication ID") and header_row is None:
header_row = i
if header_row is None:
continue
df = pd.read_excel(path, header=header_row)
df = df.dropna(how="all")
# normalizuj první sloupec na "medication_id"
df = df.rename(columns={df.columns[0]: "medication_id"})
col = df.columns.tolist()
for _, r in df.iterrows():
rows.append({
"site": site,
"investigator": investigator,
"location": location,
"medication_id": to_str(r["medication_id"]),
"packaged_lot_no": to_str(r.get("Packaged Lot number")),
"original_expiration_date": to_date(r.get("Original Expiration Date when Packaged Lot was Added")),
"expiration_date": to_date(r.get("Expiration date")),
"received_date": to_date(r.get("Received Date")),
"receipt_user": to_str(r.get("Shipment Receipt User")),
"subject_identifier": to_str(r.get("Subject Identifier")),
"quantity_assigned": to_int(r.get("Quantity Assigned")),
"irt_transaction": to_str(r.get("IRT Transaction")),
"date_assigned": to_date(r.get("Date Assigned")),
"assignment_user": to_str(r.get("Assignment User")),
"dispensation_status": to_str(r.get("Dispensation Status")),
"dispensing_date": to_date(r.get("Dispensing date") or r.get("Dispensing Date")),
"quantity_dispensed": to_int(r.get("Quantity Dispensed")),
"dispensing_user": to_str(r.get("Dispensing User")),
"quantity_returned": to_int(r.get("Quantity Returned")),
"date_returned": to_date(r.get("Date Returned")),
"return_user": to_str(r.get("Return User")),
})
return rows
def parse_destruction_files(study):
dest_dir = os.path.join(BASE_DIR, f"xls_ip_destruction_{study}")
files = sorted(glob.glob(os.path.join(dest_dir, "ip_destruction_basket_*.xlsx")))
baskets = []
for path in files:
raw = pd.read_excel(path, header=None)
# metadata z záhlaví
meta = {}
header_row = None
for i, row in raw.iterrows():
first = str(row.iloc[0]).strip() if pd.notna(row.iloc[0]) else ""
for key, attr in [
("Investigator Name:", "investigator"),
("Site ID:", "site_id"),
("Location:", "location"),
("Basket ID:", "basket_id"),
("Drug Destruction Created Date:", "destruction_date"),
]:
if first.startswith(key):
meta[attr] = first.replace(key, "").strip()
if first == "Medication ID Description" and header_row is None:
header_row = i
if header_row is None:
continue
df = pd.read_excel(path, header=header_row)
df = df.dropna(how="all")
items = []
for _, r in df.iterrows():
items.append({
"medication_description": to_str(r.get("Medication ID Description")),
"medication_id": to_str(r.get("Medication ID")),
"packaged_lot_description": to_str(r.get("Packaged Lot description")),
"comments": to_str(r.get("Comments")),
})
baskets.append({
"site_id": meta.get("site_id"),
"investigator": meta.get("investigator"),
"location": meta.get("location"),
"basket_id": meta.get("basket_id"),
"destruction_date": to_date(meta.get("destruction_date")),
"items": items,
})
return baskets
# ── inserters ────────────────────────────────────────────────────────────────
def insert_shipments(cursor, import_id, study, rows):
sql = """INSERT INTO iwrs_shipments
(import_id, study, shipment_id, status, type, ship_from, ship_to_site,
location, request_date, shipped_date, received_date, received_by,
delivered_date_utc, delivery_recipient, delivery_details, cancelled_date,
total_medication_ids, tracking_no, shipping_category, expected_arrival)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
for r in rows:
cursor.execute(sql, (
import_id, study, r["shipment_id"], r["status"], r["type"],
r["ship_from"], r["ship_to_site"], r["location"],
r["request_date"], r["shipped_date"], r["received_date"],
r["received_by"], r["delivered_date_utc"], r["delivery_recipient"],
r["delivery_details"], r["cancelled_date"], r["total_medication_ids"],
r["tracking_no"], r["shipping_category"], r["expected_arrival"],
))
def insert_shipment_items(cursor, import_id, study, rows):
sql = """INSERT INTO iwrs_shipment_items
(import_id, study, shipment_id, destination_location, shipment_status,
shipment_type, destination_site, investigator, medication_description,
medication_type, medication_id, packaged_lot_no, packaged_lot_description,
container_id, quantity, expiration_date, item_status)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
for r in rows:
cursor.execute(sql, (
import_id, study, r["shipment_id"], r["destination_location"],
r["shipment_status"], r["shipment_type"], r["destination_site"],
r["investigator"], r["medication_description"], r["medication_type"],
r["medication_id"], r["packaged_lot_no"], r["packaged_lot_description"],
r["container_id"], r["quantity"], r["expiration_date"], r["item_status"],
))
def insert_inventory(cursor, import_id, study, rows):
sql = """INSERT INTO iwrs_inventory
(import_id, study, site, investigator, location, medication_id,
packaged_lot_no, original_expiration_date, expiration_date, received_date,
receipt_user, subject_identifier, quantity_assigned, irt_transaction,
date_assigned, assignment_user, dispensation_status, dispensing_date,
quantity_dispensed, dispensing_user, quantity_returned, date_returned, return_user)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
for r in rows:
cursor.execute(sql, (
import_id, study, r["site"], r["investigator"], r["location"],
r["medication_id"], r["packaged_lot_no"], r["original_expiration_date"],
r["expiration_date"], r["received_date"], r["receipt_user"],
r["subject_identifier"], r["quantity_assigned"], r["irt_transaction"],
r["date_assigned"], r["assignment_user"], r["dispensation_status"],
r["dispensing_date"], r["quantity_dispensed"], r["dispensing_user"],
r["quantity_returned"], r["date_returned"], r["return_user"],
))
def insert_destruction(cursor, study, baskets):
sql = """INSERT IGNORE INTO iwrs_destruction
(study, site_id, investigator, location, basket_id, destruction_date,
medication_description, medication_id, packaged_lot_description, comments)
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
skipped = 0
imported = 0
for b in baskets:
if basket_already_imported(cursor, study, b["basket_id"]):
skipped += 1
continue
for item in b["items"]:
cursor.execute(sql, (
study, b["site_id"], b["investigator"], b["location"],
b["basket_id"], b["destruction_date"],
item["medication_description"], item["medication_id"],
item["packaged_lot_description"], item["comments"],
))
imported += 1
return imported, skipped
# ── main ─────────────────────────────────────────────────────────────────────
def import_study(study):
print(f"\n Parsování dat pro {study}...")
shipments = parse_shipments_report(study)
items = parse_shipment_details(study)
inventory = parse_inventory(study)
baskets = parse_destruction_files(study)
print(f" Zásilky: {len(shipments)} | Položky zásilek: {len(items)} | Sklad: {len(inventory)} | Destrukční košíky: {len(baskets)}")
conn = get_conn()
cursor = conn.cursor()
import_id = insert_import(cursor, study, f"drugs_{study}")
print(f" import_id = {import_id}")
insert_shipments(cursor, import_id, study, shipments)
insert_shipment_items(cursor, import_id, study, items)
insert_inventory(cursor, import_id, study, inventory)
dest_imported, dest_skipped = insert_destruction(cursor, study, baskets)
conn.commit()
cursor.close()
conn.close()
print(f" Destrukce: {dest_imported} nových | {dest_skipped} košíků přeskočeno (již importováno)")
def main():
for study in STUDIES:
print(f"\n{'='*60}")
print(f"[{study}]")
print(f"{'='*60}")
try:
import_study(study)
print(f" OK")
except Exception as e:
import traceback
print(f" CHYBA: {e}")
traceback.print_exc()
print("\nHotovo.")
main()
+85
View File
@@ -0,0 +1,85 @@
import sys
import os
from playwright.sync_api import sync_playwright
import download_reports
import download_ip_destruction
import download_shipments_report
import download_shipment_details
import create_accountability_report
BASE_URL = "https://janssen.4gclinical.com"
EMAIL = "vbuzalka@its.jnj.com"
PASSWORD = "Vlado123++-+"
STUDIES = {
"1": "77242113UCO3001",
"2": "42847922MDD3003",
}
def pick_study():
print("Vyber studii:")
for k, v in STUDIES.items():
print(f" {k}) {v}")
while True:
choice = input("Volba (1/2): ").strip()
if choice in STUDIES:
return STUDIES[choice]
print(" Neplatna volba, zkus znovu.")
def login_and_select_study(page, study):
print(f"\n[1/5] Prihlaseni a vyber studie {study}...")
page.goto(BASE_URL)
page.wait_for_load_state("networkidle")
page.get_by_label("Email *").fill(EMAIL)
page.get_by_label("Password *").fill(PASSWORD)
page.locator('#login__submit').click()
page.wait_for_load_state("networkidle")
page.get_by_label("Study *").click()
page.get_by_role("option", name=study).click()
page.get_by_role("button", name="SELECT").click()
page.wait_for_load_state("networkidle")
print(" OK")
def main():
os.chdir(os.path.dirname(os.path.abspath(__file__)))
study = pick_study()
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
context = browser.new_context(accept_downloads=True)
page = context.new_page()
login_and_select_study(page, study)
print(f"\n[2/5] Stahuji inventory reporty...")
download_reports.run(page, study)
print(f"\n[3/5] Stahuji IP destruction reporty...")
download_ip_destruction.run(page, study)
print(f"\n[4/5] Stahuji shipments report...")
download_shipments_report.run(page, study)
print(f"\n[5/5] Stahuji shipment details...")
download_shipment_details.run(page, study)
browser.close()
print(f"\n[6/6] Generuji accountability report...")
create_accountability_report.STUDY = study
create_accountability_report.INVENTORY_DIR = __import__("pathlib").Path(f"xls_reports_{study}")
create_accountability_report.DESTRUCTION_DIR= __import__("pathlib").Path(f"xls_ip_destruction_{study}")
create_accountability_report.SHIPMENTS_FILE = __import__("pathlib").Path(f"xls_shipments_{study}/shipments_report_{study}.xlsx")
create_accountability_report.DETAILS_DIR = __import__("pathlib").Path(f"xls_shipment_details_{study}")
create_accountability_report.OUTPUT_FILE = create_accountability_report.OUTPUT_DIR / f"{__import__('datetime').date.today().strftime('%Y-%m-%d')} {study} CZ IWRS overview.xlsx"
create_accountability_report.main()
print("\nVse hotovo!")
main()
+5
View File
@@ -0,0 +1,5 @@
DB_HOST = "192.168.1.76"
DB_PORT = 3306
DB_USER = "root"
DB_PASSWORD = "Vlado9674+"
DB_NAME = "studie"
+52
View File
@@ -0,0 +1,52 @@
import mysql.connector
import pandas as pd
import db_config
conn = mysql.connector.connect(
host=db_config.DB_HOST, port=db_config.DB_PORT,
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
database=db_config.DB_NAME,
)
cursor = conn.cursor(dictionary=True)
# Vezmi nejnovější import_id pro každou studii
for study in ["77242113UCO3001", "42847922MDD3003"]:
cursor.execute(
"SELECT MAX(import_id) AS mid FROM iwrs_import WHERE study=%s AND report_type='patients'",
(study,),
)
row = cursor.fetchone()
mid = row["mid"]
print(f"\n=== {study} (import_id={mid}) ===")
cursor.execute("""
SELECT
v.subject,
v.actual_date,
v.scheduled_date,
v.irt_transaction_no,
v.irt_transaction_description,
v.medication_assignment,
GROUP_CONCAT(v.medication_id ORDER BY v.medication_id SEPARATOR ', ') AS medication_ids,
SUM(v.quantity_assigned) AS quantity_assigned
FROM iwrs_subject_visits v
WHERE v.import_id = %s AND v.study = %s AND v.visit_type = 'Past'
AND v.irt_transaction_no IS NOT NULL
GROUP BY v.subject, v.actual_date, v.scheduled_date, v.irt_transaction_no,
v.irt_transaction_description, v.medication_assignment
ORDER BY v.subject, v.actual_date
LIMIT 20
""", (mid, study))
rows = cursor.fetchall()
df = pd.DataFrame(rows)
if df.empty:
print(" Žádná data.")
else:
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 200)
pd.set_option("display.max_colwidth", 30)
print(df.to_string(index=False))
cursor.close()
conn.close()

Some files were not shown because too many files have changed in this diff Show More