z230
This commit is contained in:
@@ -0,0 +1,209 @@
|
||||
"Protocol","Country","Site","PI Name","Subject ID","Age at Informed Consent","Baseline Stool Count","Confirm Baseline Stool Count","Data Correction ID","Creation Date UTC","Status","Description","Date of Last Action UTC","Total Open Period","Total Open Time (Days)","Current Status Time (Days)","Type","Next Action Required","Category","Query History","Reason for Change","Resolution"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","48","1","","SW00703544","13-May-2026","Submitted","Please change answer to clinical remision from no to YES (week 12). Entry erros ","20-May-2026","15-21 Days","19","14","Query Active ","Site","New","(1) 20 May 2026 msullivan (Clario): Please confirm your request
|
||||
|
||||
Dear Site. Thank you for submitting this Data Clarification Request.
|
||||
|
||||
For us to process your request, please let us know the name of the form (with date) with question.
|
||||
|
||||
Thank you. ERT/CLARIO Data Coordination Team
|
||||
|
||||
","Entry Error",""
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","79","1","","SW00696586","09-Apr-2026","ReadyForQC","Please correct date of endoscopy to date: 18 March 2026 (from 25 March 2026)","15-Apr-2026","Over 28 Days","41","37","Query Active ","Site","Site-Entered Data","","Entry Error","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: In Mayo Subscore (1) dated 08 Apr 2026 for I-0 visit, CLARIO to make the following changes:
|
||||
- What was the date of endoscopy? (ENDODT1D): from 25 Mar 2026 to 18 Mar 2026
|
||||
- Data Flag (QSDFLG1B): from blank to check
|
||||
"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","19","1","","SW00704536","19-May-2026","ReadyForQC","Please change the endoscopy date to 19-FEB-2026. 06-MAR-2026 was entered in error. ","26-May-2026","15-21 Days","15","10","Query Active ","Site","Site-Entered Data","","Entry Error","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: In Mayo Subscore (1) dated 20 Mar 2026 for I-0 visit, CLARIO to make the following changes:
|
||||
-What was the date of endoscopy? (ENDODT1D): from 06 Mar 2026 to 19 Feb 2026
|
||||
- Data Flag (QSDFLG1B): from blank to check
|
||||
"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","22","5","Yes, I confirm this is the correct stool count.","SW00706684","01-Jun-2026","Submitted","The right endoscopy date is 23MAR2026, please change the date","05-Jun-2026","4-7 Days","7","2","Query Active ","Site","New","(1) 05 Jun 2026 msullivan (Clario): Please confirm your request
|
||||
|
||||
Dear Site. Thank you for submitting this Data Clarification.
|
||||
|
||||
Please confirm that if you are requesting following.
|
||||
|
||||
Mayo Subscore (1) dated 07 Apr 2026 for I-0
|
||||
What was the date of endoscopy? (ENDODT1D): from 24 Mar 2026 to 23 Mar 2026
|
||||
|
||||
Thank you. ERT/CLARIO Data Coordination Team.
|
||||
|
||||
|
||||
","Entry Error",""
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","29","1","","SW00705646","26-May-2026","ReadyForQC","Correct visit date I-O is 12-May-2026. All questionaries were filled on paper and entered in tablet later.
|
||||
Log-in issue. ","09-Jun-2026","8-14 Days","10","1","","Clario DM","Visit Data","(1) 01 Jun 2026 msullivan (Clario): Please confirm your request
|
||||
|
||||
Dear Site. Thank you for submitting this Data Clarification.
|
||||
|
||||
Please provide the timestamps for each of the assessments if you used paper forms and transcribed into the device.
|
||||
If unknown, ERT will use a dummy timestamp.
|
||||
|
||||
Thank you. ERT/CLARIO Data Coordination Team.
|
||||
|
||||
(2) 01 Jun 2026 dstepek@vnbrno.cz (Site User): time is unknown
|
||||
|
||||
","Changed Information","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: In the following forms for I-0, CLARIO to make the following changes:
|
||||
-Report Date: from 26May 2026 to 12 May 2026
|
||||
-Report Start Date and time: from 26 May 2026 to 12 May 2026 23:59:59
|
||||
-Event End Date: from 26 May 2026 08:27:57 to 12 May 2026 23:59:59
|
||||
|
||||
+Tablet Training Module (1)
|
||||
+Participant Start Instructions (1)
|
||||
+IBDQ (1)
|
||||
+PROMIS Fatigue – Short Form 7a (1)
|
||||
+BASDAI (1)
|
||||
+Participant End Instructions (1)
|
||||
+Visit End (122)
|
||||
"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","49","1","","SW00708623","10-Jun-2026","Submitted","Correct date of I-2 is 26.5.2026. all questionaries were entered on paper at 07,45 and transmited later. ","10-Jun-2026","1 Day","1","","","Clario DM","New","","Changed Information",""
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","49","1","","SW00706581","29-May-2026","ReadyForQC","baseline stool count reported by subject is 0, please change to 1 as per CRA request (subject has 1 stool in 2-3 days if in remission)","05-Jun-2026","4-7 Days","7","3","","Clario DM","Demographic","","Changed Information","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: In System Variables form, CLARIO to make the following changes:
|
||||
- Baseline Stool Count (PT.Custom4): from 0 to 1
|
||||
"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10016","Robert Mudr","CZ100162001","48","1","","SW00705916","27-May-2026","ReadyForQC","As per ATS investigation (ATS26040111), please remove the below form which was entered as a duplicate
|
||||
|
||||
- MAYO Diary (5) 24 Apr 2026","05-Jun-2026","8-14 Days","9","3","","Clario DM","Technical Revision","","Technical Revision - Other","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: CLARIO to delete MAYO Diary (5) dated 24 Apr 2026
|
||||
"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","15","1","","SW00701729","06-May-2026","Completed","Dears, please delete data from visit I-0 (reported as 4th of May 2026) as this visit had to be postponed - see the previous DCR of this patient and change data request that was corrected. Patient has left the site before it was resolved and and new date of I-0 was planned. Patient continues to fill in his diary and patient is coming to I=0 visit within allowed window. We need the system and tablet to be ready to run new Mayo Score Report with updated and recent data (e.g. reflect new I-0 visit date, new eligible days -1 to -7.).
|
||||
thank you, Jiri Skopek","19-May-2026","8-14 Days","8","","","","Visit Data","(1) 11 May 2026 msullivan (Clario): Please confirm your request
|
||||
|
||||
Dear Site. Thank you for submitting this Data Clarification.
|
||||
|
||||
Please note that the delete forms are allowed if the reason is one of the following.
|
||||
If not, forms will move to unscheduled visit.
|
||||
|
||||
Data collected by the wrong patient.
|
||||
Data collected by someone other than the patient.
|
||||
Data collected prior to informed consent, or after withdrawal from the study.
|
||||
Duplicate data erroneously entered at an Unscheduled visit via paper transcription.
|
||||
Data collected that is not expected per protocol.
|
||||
|
||||
Also, I-0 visit is still ongoing. Please close the visit.
|
||||
Once the visit was closed, we will process accoridngly.
|
||||
|
||||
Thank you. ERT/CLARIO Data Coordination Team
|
||||
|
||||
(2) 11 May 2026 jskopek (Site User): Dears,
|
||||
I do not see any option that is adequate -from the list. Data are not needed to be deleted fully, they reflect the situation at May4th. Please mark it as unscheduled visit - as exactly that is the case. We need the system to be ready for I-0 visit planned for next week.
|
||||
I will close the visit tomorrow - do you mean in tablet/ipad?
|
||||
Thank you very much for your help! Jiri
|
||||
|
||||
(3) 12 May 2026 venkata.ramana (Clario): Thank you for your response.
|
||||
Please note that the visit I-0 was still ongoing but not closed yet.
|
||||
So please close the visit.
|
||||
Kind Regards, Clario Data Coordination Team.
|
||||
|
||||
(4) 12 May 2026 jskopek (Site User): If I try to close the I-O visit in TABLET, it asks me if patient fulfils eligibility criteria to proceed to next visit based on these old data – if I answer NO, it asks me to DEACTIVATE patient. I do not want to DEACTIVATE patient – can you help WHERE and HOW to close this visit for you to change it to UNSCHEDULED and not to de-activate patient?
|
||||
Thank you Jiri
|
||||
|
||||
|
||||
","Other-delete visit I-0","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: In the following forms dated 04 May 2026, CLARIO to make the following changes:
|
||||
-Event ID: from I-0 to Unscheduled Visit 1
|
||||
-Event At Entry: from I-0 to Unscheduled Visit 1
|
||||
|
||||
+Visit Start (49)
|
||||
+ePRO Availability (1)
|
||||
+Mayo Subscore (1)
|
||||
+PGA (1)
|
||||
|
||||
Part 2: CLARIO to delete the following forms dated 04 May 2026 for I-0 visit.
|
||||
|
||||
+C-SSRS Since Last Visit (1)
|
||||
+C-SSRS Since Last Visit Findings Report (1)
|
||||
|
||||
Part 3: CLARIO to manually enter Visit End form for Unscheduled visit 1 with the following information:
|
||||
-Protocol: 77242113UCO3001
|
||||
-Report Date: 04 May 2026
|
||||
-Report Start Date and Time: 04 May 2026 23:59:59
|
||||
-Event ID: Unscheduled Visit 1
|
||||
-Event End Date: 04 May 2026 23:59:59
|
||||
-Visit Status: Incomplete
|
||||
-Phase At Entry: Screening
|
||||
-Phase At Entry Timestamp: 13 Apr 2026 12:32:20
|
||||
-Event At Entry: Unscheduled visit 1
|
||||
-Event Start Date: 04 May 2026 23:59:59
|
||||
-Event Time Zone Offset in Milliseconds: 7200000
|
||||
-Session Repeat Number (SESREP1N): 0
|
||||
-Session Instance Id (SESINST1S): 3f1214f0-4788-11f1-a0cf-bb403212adce
|
||||
"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","15","1","","SW00701226","04-May-2026","Completed","Dears, we would like ask you to change the information I read on assignment form given by patient on April 13, 2026 (Visit 1), Baseline Stool Count (PT.Custom4) as 3 that should be reported as 1.
|
||||
Patient has entered wrong number as he did not understood it should be number of stools when illness is in remission or absent. He is a child and did not reflected this question correctly. Therefore, please change Baseline Stool Count = 1.
|
||||
Thank you, Jiri Skopek ","04-May-2026","1 Day","1","","","","Demographic","","Changed Information","(Clario instructions)
|
||||
|
||||
1. Please make below changes in the assignment form:
|
||||
|
||||
Baseline Stool Count (PT. Custom4): 03 to 01."
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","61","1","","SW00699492","23-Apr-2026","ReadyForQC","Please correct the date of endoscopy done during screening visit of patient CZ100212001 to correct date 16-MAR-2026.","29-Apr-2026","Over 28 Days","32","28","Query Active ","Site","Site-Entered Data","","Changed Information","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: In the Mayo Subscore (1) dated 07 Apr 2026 for I-0 visit, CLARIO to make the following changes:
|
||||
-What was the date of endoscopy? (ENDODT1D): from 24 Mar 2026 to 16 Mar 2026
|
||||
- Data Flag (QSDFLG1B): from blank to check
|
||||
"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","39","1","","SW00703322","12-May-2026","Completed","As per ATS investigation (ATS26040111), please remove the below form that's been entered as a duplicate
|
||||
|
||||
- MAYO Diary (16) - 18 Mar 2026
|
||||
","20-May-2026","4-7 Days","6","","","","Technical Revision","","Technical Revision - Other","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: CLARIO to delete the MAYO Diary (16) dated 18 Mar 2026.
|
||||
"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","39","1","","SW00689748","09-Mar-2026","Completed","Dear all,
|
||||
|
||||
Patient CZ 100222003 was randomized on 9 Mar 2026. Kindly correct the colonoscopy date to 11 Feb 2025.
|
||||
|
||||
The date was initially entered as 21 Feb 2025 because the earlier date could not be entered in the system. The patient was rescreened.","02-Apr-2026","15-21 Days","17","","","","Site-Entered Data","(1) 13 Mar 2026 msullivan (Clario): Please confirm your request
|
||||
|
||||
Dear Site. Thank you for submitting this Data Clarification.
|
||||
|
||||
Could you please conform that if you are requesting following?
|
||||
|
||||
Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit
|
||||
-What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025
|
||||
|
||||
Could you please confirm the year? This subject was assigned on 02 Mar 2026, you are providing that correct date is 11 Feb 2025 which a year ago.
|
||||
If you are not requesting above, please provide us the name of the form with question.
|
||||
|
||||
Thank you. ERT/CLARIO Data Coordination Team
|
||||
|
||||
|
||||
(2) 13 Mar 2026 katerina.havlikova@clinoxus.com (Site User): confirm date of colonoscopy 11Feb2026
|
||||
|
||||
(3) 21 Mar 2026 msullivan (Clario): Dear Site,
|
||||
|
||||
The requested changes to the Mayo data have been updated. Please navigate to the Mayo Score Report and resubmit the form for visit to log the updated Mayo Score form. Once done, please respond to this query confirming that the Mayo Score has been resubmitted.
|
||||
|
||||
Thank you. ERT/CLARIO Data Coordination Team
|
||||
|
||||
(4) 24 Mar 2026 jana.pomahacova@clinoxus.com (Site User): Thank you and sent
|
||||
|
||||
","New Information","CLARIO RESOLUTION:
|
||||
|
||||
Part 1: In the Mayo Subscore (1) dated 09 Mar 2026 for I-0 visit, CLARIO to make the following changes:
|
||||
-What was the date of endoscopy? (ENDODT1D): from 23 Feb 2026 to 11 Feb 2025
|
||||
-Data Flag (QSDFLG1B): from blank to check"
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","33","1","","SW00705372","22-May-2026","Submitted","Dear all, please change Colonoscopz date from 8April2026 to date 01Apr2026 Thank you in advance","02-Jun-2026","8-14 Days","12","5","","Clario DM","New","(1) 29 May 2026 msullivan (Clario): Please confirm your request
|
||||
|
||||
Dear Site. Thank you for submitting this Data Clarification.
|
||||
|
||||
Please provide us the name of the form for this request.
|
||||
|
||||
Thank you. ERT/CLARIO Data Coordination Team
|
||||
|
||||
(2) 02 Jun 2026 katerina.havlikova@clinoxus.com (Site User): Dear all, please change Colonoscopy for Week I-12 date from 8April2026 to date 01Apr2026 Thank you in advance
|
||||
|
||||
","Changed Information",""
|
||||
"77242113UCO3001","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","33","1","","SW00702538","08-May-2026","Completed","This TRR is to document the correction to the Mayo Subscore (1) form, where the following variables were populated with NULL values, due to a known core defect:
|
||||
Event At Entry, Event Start Date, Event Time Zone Offset in Milliseconds.","12-May-2026","2-3 Days","2","","","","Technical Revision","","Technical Revision - Other","Please make the below changes in Mayo Subscore (1) dated 22 Apr 2026:
|
||||
|
||||
-Event At Entry: I-0
|
||||
-Event Start Date: 09 Apr 2026 08:09:19
|
||||
-Event Time Zone Offset in Milliseconds: 7200000"
|
||||
|
+1328
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,53 @@
|
||||
"Protocol","Study Population","Country","Site","Principal Investigator","Participant ID","Baseline Stool Frequency","Visit","Visit Date","Endoscopy Completed?","Endoscopy Date","Bowel Preparation Start Date 1","Bowel Preparation End Date 1","Bowel Preparation Start Date 2","Bowel Preparation End Date 2","Central Endoscopy Score","Local Endoscopy Score","PGA Score","Eligible Day (-1)","Day (-1) Excluded Reason(s)","Eligible Day (-2)","Day (-2) Excluded Reason(s)","Eligible Day (-3)","Day (-3) Excluded Reason(s)","Eligible Day (-4)","Day (-4) Excluded Reason(s)","Eligible Day (-5)","Day (-5) Excluded Reason(s)","Eligible Day (-6)","Day (-6) Excluded Reason(s)","Eligible Day (-7)","Day (-7) Excluded Reason(s)","Eligible Day (-8)","Day (-8) Excluded Reason(s)","Eligible Day (-9)","Day (-9) Excluded Reason(s)","Eligible Day (-10)","Day (-10) Excluded Reason(s)","Eligible Day (-1) Stool Count","Eligible Day (-2) Stool Count","Eligible Day (-3) Stool Count","Eligible Day (-4) Stool Count","Eligible Day (-5) Stool Count","Eligible Day (-6) Stool Count","Eligible Day (-7) Stool Count","Eligible Day (-8) Stool Count","Eligible Day (-9) Stool Count","Eligible Day (-10) Stool Count","Stool Frequency Sub-score","Eligible Day (-1) Rectal Bleeding Score","Eligible Day (-2) Rectal Bleeding Score","Eligible Day (-3) Rectal Bleeding Score","Eligible Day (-4) Rectal Bleeding Score","Eligible Day (-5) Rectal Bleeding Score","Eligible Day (-6) Rectal Bleeding Score","Eligible Day (-7) Rectal Bleeding Score","Eligible Day (-8) Rectal Bleeding Score","Eligible Day (-9) Rectal Bleeding Score","Eligible Day (-10) Rectal Bleeding Score","Rectal Bleeding Sub-score","Partial Mayo Score","Modified Mayo Score","Full Mayo Score","Site Action","Last Mayo Score Submission","Week I-12 Clinical Responder","Week I-12 Clinical Remission","Clinical Flare","Loss of Response","Partial Mayo Response Post Loss of Response","Partial Mayo Response for Clinical Non-Responders"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-0","19 Feb 2026","Yes","05 Feb 2026","04 Feb 2026","04 Feb 2026","-","-","2","-","3","18 Feb 2026","-","17 Feb 2026","-","16 Feb 2026","-","15 Feb 2026","-","14 Feb 2026","-","13 Feb 2026","-","12 Feb 2026","-","11 Feb 2026","Day Not Applicable for Calculation","10 Feb 2026","Day Not Applicable for Calculation","09 Feb 2026","Day Not Applicable for Calculation","10","8","7","5","7","8","8","-","-","-","3","1","1","1","0","1","1","1","-","-","-","1","7","6","9","-","08 Apr 2026 07:11:25","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-2","04 Mar 2026","-","-","-","-","-","-","-","-","3","03 Mar 2026","-","02 Mar 2026","-","01 Mar 2026","-","28 Feb 2026","-","27 Feb 2026","-","26 Feb 2026","-","25 Feb 2026","-","24 Feb 2026","Day Not Applicable for Calculation","23 Feb 2026","Day Not Applicable for Calculation","22 Feb 2026","Day Not Applicable for Calculation","5","4","5","4","5","6","6","-","-","-","2","1","0","1","0","1","0","1","-","-","-","1","6","","","-","28 May 2026 10:04:05","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-4","18 Mar 2026","-","-","-","-","-","-","-","-","2","17 Mar 2026","-","16 Mar 2026","-","15 Mar 2026","-","14 Mar 2026","-","13 Mar 2026","-","12 Mar 2026","-","11 Mar 2026","-","10 Mar 2026","Day Not Applicable for Calculation","09 Mar 2026","Day Not Applicable for Calculation","08 Mar 2026","Day Not Applicable for Calculation","5","5","5","4","5","4","5","-","-","-","2","1","0","0","1","1","1","0","-","-","-","1","5","","","-","08 Apr 2026 11:04:49","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-8","05 May 2026","-","-","-","-","-","-","-","-","1","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","3","3","4","4","5","4","4","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","4","","","-","28 May 2026 14:42:53","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","I-12","13 May 2026","Yes","06 May 2026","05 May 2026","05 May 2026","-","-","1","-","1","12 May 2026","-","11 May 2026","-","10 May 2026","-","09 May 2026","-","08 May 2026","-","07 May 2026","-","06 May 2026","Endoscopy","05 May 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","04 May 2026","-","03 May 2026","Day Not Applicable for Calculation","5","4","6","5","5","5","-","-","3","-","2","1","0","1","1","1","1","-","-","1","-","1","4","4","5","-","10 Jun 2026 07:16:05","Clinical Responder","No","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012001","1","M-4","10 Jun 2026","-","-","-","-","-","-","-","-","1","09 Jun 2026","-","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","Day Not Applicable for Calculation","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","4","5","3","4","5","4","5","-","-","-","2","0","0","0","0","1","0","1","-","-","-","0","3","","","-","10 Jun 2026 07:15:50","N/A","N/A","No","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-0","08 Apr 2026","Yes","18 Mar 2026","17 Mar 2026","18 Mar 2026","-","-","2","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","Missing Diary","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","3","3","4","-","3","3","4","-","-","-","1","0","0","0","-","0","0","1","-","-","-","0","3","3","5","-","10 Jun 2026 08:42:08","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-2","23 Apr 2026","-","-","-","-","-","-","-","-","2","22 Apr 2026","Missing Diary","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","Day Not Applicable for Calculation","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","-","3","3","6","5","5","4","-","-","-","2","-","0","0","1","1","1","1","-","-","-","1","5","","","-","10 Jun 2026 08:42:33","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-4","06 May 2026","-","-","-","-","-","-","-","-","1","05 May 2026","-","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","Day Not Applicable for Calculation","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","6","3","2","3","3","3","3","-","-","-","1","1","0","0","0","1","1","0","-","-","-","0","2","","","-","28 May 2026 14:43:38","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012002","1","I-8","04 Jun 2026","-","-","-","-","-","-","-","-","1","03 Jun 2026","-","02 Jun 2026","-","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","Day Not Applicable for Calculation","26 May 2026","Day Not Applicable for Calculation","25 May 2026","Day Not Applicable for Calculation","3","4","3","3","3","3","4","-","-","-","1","0","0","0","0","0","0","1","-","-","-","0","2","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012003","1","I-0","27 May 2026","Yes","13 May 2026","12 May 2026","12 May 2026","-","-","3","-","2","26 May 2026","-","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","Day Not Applicable for Calculation","18 May 2026","Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","6","9","7","8","9","7","8","-","-","-","3","2","2","2","2","1","1","1","-","-","-","2","7","8","10","-","27 May 2026 07:24:39","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10001","Matej Falc","CZ100012003","1","I-2","10 Jun 2026","-","-","-","-","-","-","-","-","2","09 Jun 2026","-","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","Day Not Applicable for Calculation","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","7","8","8","7","6","8","6","-","-","-","3","2","2","1","2","2","2","1","-","-","-","2","7","","","-","10 Jun 2026 07:30:18","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10003","Leksa Vaclav","CZ100032001","2","I-0","10 Jun 2026","Yes","27 May 2026","26 May 2026","26 May 2026","-","-","2","-","2","09 Jun 2026","Missing Diary","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","Day Not Applicable for Calculation","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","-","4","4","4","5","4","5","-","-","-","1","-","2","2","2","2","2","2","-","-","-","2","5","5","7","-","10 Jun 2026 08:48:09","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-0","20 Mar 2026","Yes","19 Feb 2026","-","-","-","-","3","-","3","19 Mar 2026","-","18 Mar 2026","-","17 Mar 2026","-","16 Mar 2026","-","15 Mar 2026","-","14 Mar 2026","-","13 Mar 2026","-","12 Mar 2026","Day Not Applicable for Calculation","11 Mar 2026","Day Not Applicable for Calculation","10 Mar 2026","Day Not Applicable for Calculation","7","7","8","8","7","8","5","-","-","-","3","2","1","1","1","1","1","0","-","-","-","1","7","7","10","-","20 Mar 2026 07:03:23","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-2","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","Medication For Diarrhea","06 Apr 2026","Medication For Diarrhea","05 Apr 2026","Medication For Diarrhea","04 Apr 2026","Medication For Diarrhea","03 Apr 2026","Medication For Diarrhea","02 Apr 2026","Medication For Diarrhea","01 Apr 2026","Medication For Diarrhea","31 Mar 2026","Medication For Diarrhea;Day Not Applicable for Calculation","30 Mar 2026","Medication For Diarrhea;Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","-","-","-","-","-","-","-","-","-","-","Non-Evaluable","-","-","-","-","-","-","-","-","-","-","Non-Evaluable","Non-Evaluable","Non-Evaluable","Non-Evaluable","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-4","15 Apr 2026","-","-","-","-","-","-","-","-","3","14 Apr 2026","-","13 Apr 2026","-","12 Apr 2026","-","11 Apr 2026","-","10 Apr 2026","-","09 Apr 2026","-","08 Apr 2026","-","07 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","06 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","05 Apr 2026","Medication For Diarrhea;Day Not Applicable for Calculation","9","22","20","19","17","18","18","-","-","-","3","1","3","2","2","2","2","2","-","-","-","2","8","","","-","04 May 2026 22:06:03","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-8","18 May 2026","-","-","-","-","-","-","-","-","2","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","-","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","08 May 2026","Day Not Applicable for Calculation","7","5","9","7","7","8","8","-","-","-","3","1","1","1","1","1","1","1","-","-","-","1","6","","","-","04 Jun 2026 21:46:30","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062001","1","I-12","08 Jun 2026","Yes","28 May 2026","-","-","-","-","3","-","3","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","-","01 Jun 2026","Missing Diary","31 May 2026","Day Not Applicable for Calculation","30 May 2026","Day Not Applicable for Calculation","29 May 2026","Day Not Applicable for Calculation","6","5","5","5","7","6","-","-","-","-","3","1","1","0","0","1","0","-","-","-","-","1","7","7","10","-","-","Clinical Nonresponder","No","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062002","1","I-0","26 May 2026","Yes","14 May 2026","13 May 2026","13 May 2026","-","-","2","-","2","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","-","18 May 2026","Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","16 May 2026","Day Not Applicable for Calculation","8","8","6","7","7","6","7","-","-","-","3","2","2","2","2","2","2","2","-","-","-","2","7","7","9","-","29 May 2026 15:45:00","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10006","Michal Konecny","CZ100062002","1","I-2","09 Jun 2026","-","-","-","-","-","-","-","-","2","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","-","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","30 May 2026","Day Not Applicable for Calculation","7","8","7","7","7","5","7","-","-","-","3","2","1","1","1","2","2","2","-","-","-","2","7","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10009","Jiri Pumprla","CZ100092001","1","I-0","05 May 2026","Yes","24 Apr 2026","23 Apr 2026","23 Apr 2026","-","-","2","-","2","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","5","5","5","5","5","5","5","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","5","5","7","-","05 May 2026 11:19:40","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10009","Jiri Pumprla","CZ100092001","1","I-2","19 May 2026","-","-","-","-","-","-","-","-","1","18 May 2026","-","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","Day Not Applicable for Calculation","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","5","4","5","5","5","4","6","-","-","-","2","1","1","1","1","1","1","1","-","-","-","1","4","","","-","19 May 2026 10:38:25","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10009","Jiri Pumprla","CZ100092001","1","I-4","04 Jun 2026","-","-","-","-","-","-","-","-","1","03 Jun 2026","-","02 Jun 2026","-","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","Day Not Applicable for Calculation","26 May 2026","Day Not Applicable for Calculation","25 May 2026","Day Not Applicable for Calculation","2","3","2","3","3","2","3","-","-","-","1","0","0","0","0","0","0","0","-","-","-","0","2","","","-","04 Jun 2026 09:24:54","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-0","07 Apr 2026","Yes","24 Mar 2026","22 Mar 2026","22 Mar 2026","-","-","2","-","2","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","-","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","28 Mar 2026","Day Not Applicable for Calculation","8","11","5","9","11","10","13","-","-","-","3","1","2","2","2","2","2","2","-","-","-","2","7","7","9","-","04 May 2026 08:44:52","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-2","22 Apr 2026","-","-","-","-","-","-","-","-","2","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","7","5","6","6","7","8","2","-","-","-","1","1","0","1","1","1","2","0","-","-","-","1","4","","","-","04 May 2026 08:45:07","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-4","07 May 2026","-","-","-","-","-","-","-","-","1","06 May 2026","-","05 May 2026","-","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","Day Not Applicable for Calculation","28 Apr 2026","Day Not Applicable for Calculation","27 Apr 2026","Day Not Applicable for Calculation","8","7","7","8","4","11","7","-","-","-","1","2","1","1","1","0","1","1","-","-","-","1","3","","","-","01 Jun 2026 00:57:35","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10012","Stefan Konecny","CZ100122001","5","I-8","03 Jun 2026","-","-","-","-","-","-","-","-","2","02 Jun 2026","-","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","Day Not Applicable for Calculation","25 May 2026","Day Not Applicable for Calculation","24 May 2026","Day Not Applicable for Calculation","5","9","7","5","5","9","7","-","-","-","1","1","1","1","0","3","0","1","-","-","-","1","4","","","-","03 Jun 2026 17:47:25","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-0","24 Mar 2026","Yes","12 Mar 2026","11 Mar 2026","11 Mar 2026","-","-","2","-","2","23 Mar 2026","-","22 Mar 2026","-","21 Mar 2026","-","20 Mar 2026","-","19 Mar 2026","-","18 Mar 2026","-","17 Mar 2026","-","16 Mar 2026","Day Not Applicable for Calculation","15 Mar 2026","Day Not Applicable for Calculation","14 Mar 2026","Day Not Applicable for Calculation","8","6","5","7","6","7","6","-","-","-","3","1","1","1","0","1","1","1","-","-","-","1","6","6","8","-","05 Apr 2026 22:41:27","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-2","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","5","2","3","6","5","5","5","-","-","-","2","0","0","0","0","1","1","0","-","-","-","0","4","","","-","28 May 2026 23:19:03","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132001","1","I-4","21 Apr 2026","-","-","-","-","-","-","-","-","0","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","-","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","11 Apr 2026","Day Not Applicable for Calculation","4","3","4","3","3","4","4","-","-","-","2","0","0","0","0","0","0","0","-","-","-","0","2","","","-","27 May 2026 12:54:41","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","1","I-0","12 May 2026","Yes","21 Apr 2026","20 Apr 2026","21 Apr 2026","-","-","2","-","2","11 May 2026","-","10 May 2026","-","09 May 2026","-","08 May 2026","-","07 May 2026","-","06 May 2026","-","05 May 2026","Missing Diary","04 May 2026","Day Not Applicable for Calculation","03 May 2026","Day Not Applicable for Calculation","02 May 2026","Day Not Applicable for Calculation","2","1","1","1","1","2","-","-","-","-","0","0","0","0","0","0","0","-","-","-","-","0","2","2","4","-","28 May 2026 23:19:30","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132002","1","I-2","26 May 2026","-","-","-","-","-","-","-","-","1","25 May 2026","-","24 May 2026","Missing Diary","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","-","19 May 2026","-","18 May 2026","Missing Diary;Day Not Applicable for Calculation","17 May 2026","Day Not Applicable for Calculation","16 May 2026","Day Not Applicable for Calculation","1","-","1","2","1","2","2","-","-","-","1","0","-","0","0","0","0","0","-","-","-","0","2","","","-","28 May 2026 23:19:51","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","1","I-0","02 Jun 2026","Yes","25 May 2026","24 May 2026","24 May 2026","-","-","2","-","2","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","Endoscopy;Missing Diary;Day Not Applicable for Calculation","24 May 2026","Bowel Preparation for Procedure;Missing Diary;Day Not Applicable for Calculation","23 May 2026","Missing Diary;Day Not Applicable for Calculation","8","8","11","10","10","11","6","-","-","-","3","2","2","1","2","1","2","2","-","-","-","2","7","7","9","-","02 Jun 2026 08:17:40","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10013","David Stepek","CZ100132003","1","I-2","10 Jun 2026","-","-","-","-","-","-","-","-","2","09 Jun 2026","-","08 Jun 2026","-","07 Jun 2026","-","06 Jun 2026","-","05 Jun 2026","-","04 Jun 2026","-","03 Jun 2026","-","02 Jun 2026","Day Not Applicable for Calculation","01 Jun 2026","Day Not Applicable for Calculation","31 May 2026","Day Not Applicable for Calculation","9","2","1","4","2","4","2","-","-","-","1","1","1","0","1","1","1","0","-","-","-","1","4","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10016","Robert Mudr","CZ100162001","1","I-0","28 May 2026","Yes","19 May 2026","18 May 2026","19 May 2026","-","-","3","-","3","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","-","23 May 2026","-","22 May 2026","-","21 May 2026","-","20 May 2026","Day Not Applicable for Calculation","19 May 2026","Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation","18 May 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","14","15","15","15","15","15","15","-","-","-","3","2","3","3","2","2","3","3","-","-","-","3","9","9","12","-","28 May 2026 10:19:28","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","Unscheduled 1","04 May 2026","Yes","20 Apr 2026","12 Apr 2026","15 Apr 2026","-","-","2","-","3","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","-","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","24 Apr 2026","Day Not Applicable for Calculation","5","6","6","7","6","3","3","-","-","-","2","0","0","0","0","0","0","0","-","-","-","0","5","4","7","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","I-0","18 May 2026","Yes","01 May 2026","01 May 2026","01 May 2026","-","-","2","-","3","17 May 2026","-","16 May 2026","-","15 May 2026","-","14 May 2026","-","13 May 2026","-","12 May 2026","-","11 May 2026","-","10 May 2026","Day Not Applicable for Calculation","09 May 2026","Day Not Applicable for Calculation","08 May 2026","Day Not Applicable for Calculation","6","6","6","6","6","6","6","-","-","-","3","0","0","0","0","0","0","0","-","-","-","0","6","5","8","-","18 May 2026 08:39:27","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adolescent","Czech Republic","DD5-CZ10020","Lucie Gonsorcikova","CZ100201001","1","I-2","01 Jun 2026","-","-","-","-","-","-","-","-","3","31 May 2026","-","30 May 2026","Missing Diary","29 May 2026","Missing Diary","28 May 2026","Missing Diary","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","22 May 2026","Day Not Applicable for Calculation","6","-","-","-","6","6","6","-","-","-","3","0","-","-","-","0","0","0","-","-","-","0","6","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-0","07 Apr 2026","Yes","16 Mar 2026","15 Mar 2026","16 Mar 2026","-","-","3","-","3","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","-","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","28 Mar 2026","Day Not Applicable for Calculation","11","11","10","11","11","10","9","-","-","-","3","2","2","2","2","2","2","2","-","-","-","2","8","8","11","-","20 Apr 2026 09:27:58","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-2","20 Apr 2026","-","-","-","-","-","-","-","-","3","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","-","13 Apr 2026","-","12 Apr 2026","Day Not Applicable for Calculation","11 Apr 2026","Day Not Applicable for Calculation","10 Apr 2026","Day Not Applicable for Calculation","8","7","9","8","8","7","8","-","-","-","3","2","2","1","1","1","2","1","-","-","-","1","7","","","-","20 Apr 2026 09:29:01","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-4","05 May 2026","-","-","-","-","-","-","-","-","1","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","6","6","6","6","7","7","6","-","-","-","3","0","0","1","1","1","1","1","-","-","-","1","5","","","-","-","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10021","Martin Bortlik","CZ100212001","1","I-8","02 Jun 2026","-","-","-","-","-","-","-","-","1","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","Day Not Applicable for Calculation","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","3","4","4","4","5","5","5","-","-","-","2","0","0","0","0","0","1","1","-","-","-","0","3","","","-","02 Jun 2026 14:44:34","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222002","1","I-0","19 Feb 2026","Yes","11 Feb 2026","10 Feb 2026","11 Feb 2026","-","-","2","-","2","18 Feb 2026","-","17 Feb 2026","-","16 Feb 2026","-","15 Feb 2026","-","14 Feb 2026","-","13 Feb 2026","-","12 Feb 2026","-","11 Feb 2026","Endoscopy;Bowel Preparation for Procedure;Day Not Applicable for Calculation","10 Feb 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","09 Feb 2026","Day Not Applicable for Calculation","3","2","2","3","4","3","2","-","-","-","1","1","1","0","0","0","2","2","-","-","-","1","4","4","6","-","19 Feb 2026 15:37:49","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-0","09 Mar 2026","Yes","11 Feb 2026","10 Feb 2026","11 Feb 2026","-","-","2","-","2","08 Mar 2026","-","07 Mar 2026","-","06 Mar 2026","-","05 Mar 2026","-","04 Mar 2026","-","03 Mar 2026","Missing Diary","02 Mar 2026","Missing Diary","01 Mar 2026","Missing Diary;Day Not Applicable for Calculation","28 Feb 2026","Missing Diary;Day Not Applicable for Calculation","27 Feb 2026","Missing Diary;Day Not Applicable for Calculation","7","7","6","6","7","-","-","-","-","-","3","2","2","2","2","2","-","-","-","-","-","2","7","7","9","-","24 Mar 2026 14:23:10","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-2","27 Mar 2026","-","-","-","-","-","-","-","-","2","26 Mar 2026","-","25 Mar 2026","-","24 Mar 2026","-","23 Mar 2026","-","22 Mar 2026","-","21 Mar 2026","-","20 Mar 2026","-","19 Mar 2026","Day Not Applicable for Calculation","18 Mar 2026","Day Not Applicable for Calculation","17 Mar 2026","Day Not Applicable for Calculation","7","3","3","3","5","5","5","-","-","-","2","0","0","1","1","1","1","2","-","-","-","1","5","","","-","08 Apr 2026 07:36:56","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-4","08 Apr 2026","-","-","-","-","-","-","-","-","2","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","-","31 Mar 2026","Day Not Applicable for Calculation","30 Mar 2026","Day Not Applicable for Calculation","29 Mar 2026","Day Not Applicable for Calculation","3","3","4","4","5","4","3","-","-","-","2","1","0","0","2","1","1","2","-","-","-","1","5","","","-","08 Apr 2026 07:59:35","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-8","04 May 2026","-","-","-","-","-","-","-","-","2","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","-","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","24 Apr 2026","Missing Diary;Day Not Applicable for Calculation","3","5","3","3","3","2","3","-","-","-","1","0","0","0","0","0","0","0","-","-","-","0","3","","","-","04 May 2026 08:08:40","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222003","1","I-12","01 Jun 2026","Yes","20 May 2026","19 May 2026","20 May 2026","-","-","3","-","2","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","-","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","22 May 2026","Day Not Applicable for Calculation","4","4","6","3","3","3","3","-","-","-","2","1","1","2","1","1","1","2","-","-","-","1","5","6","8","-","01 Jun 2026 14:25:57","Clinical Nonresponder","No","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-0","09 Apr 2026","Yes","08 Apr 2026","31 Mar 2026","01 Apr 2026","-","-","2","-","2","08 Apr 2026","Endoscopy","07 Apr 2026","-","06 Apr 2026","-","05 Apr 2026","-","04 Apr 2026","-","03 Apr 2026","-","02 Apr 2026","-","01 Apr 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","31 Mar 2026","Bowel Preparation for Procedure;Day Not Applicable for Calculation","30 Mar 2026","-","-","3","3","4","3","4","3","-","-","3","1","-","2","2","2","2","2","2","-","-","2","2","5","5","7","-","29 May 2026 11:07:08","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-2","22 Apr 2026","-","-","-","-","-","-","-","-","2","21 Apr 2026","-","20 Apr 2026","-","19 Apr 2026","-","18 Apr 2026","-","17 Apr 2026","-","16 Apr 2026","-","15 Apr 2026","-","14 Apr 2026","Day Not Applicable for Calculation","13 Apr 2026","Day Not Applicable for Calculation","12 Apr 2026","Day Not Applicable for Calculation","3","3","5","3","2","3","2","-","-","-","1","1","2","2","1","1","1","2","-","-","-","1","4","","","-","05 May 2026 07:29:35","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-4","05 May 2026","-","-","-","-","-","-","-","-","2","04 May 2026","-","03 May 2026","-","02 May 2026","-","01 May 2026","-","30 Apr 2026","-","29 Apr 2026","-","28 Apr 2026","-","27 Apr 2026","Day Not Applicable for Calculation","26 Apr 2026","Day Not Applicable for Calculation","25 Apr 2026","Day Not Applicable for Calculation","4","2","2","2","2","2","2","-","-","-","1","1","1","1","1","2","1","1","-","-","-","1","4","","","-","05 May 2026 07:28:55","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
"77242113UCO3001","Adult","Czech Republic","DD5-CZ10022","Petr Hrabak","CZ100222005","1","I-8","02 Jun 2026","-","-","-","-","-","-","-","-","2","01 Jun 2026","-","31 May 2026","-","30 May 2026","-","29 May 2026","-","28 May 2026","-","27 May 2026","-","26 May 2026","-","25 May 2026","Day Not Applicable for Calculation","24 May 2026","Day Not Applicable for Calculation","23 May 2026","Day Not Applicable for Calculation","2","2","2","2","2","4","10","-","-","-","1","2","1","2","1","2","2","2","-","-","-","2","5","","","-","02 Jun 2026 08:18:08","N/A","N/A","N/A","N/A","N/A","N/A"
|
||||
|
@@ -0,0 +1,83 @@
|
||||
# jnj_tower_ingest v1.0.0
|
||||
|
||||
**Soubor:** `jnj_tower_ingest_v1.0.py`
|
||||
**Datum:** 2026-06-10
|
||||
**Autor:** vladimir.buzalka
|
||||
**Běží:** Docker kontejner `python-runner` na Unraid Tower (192.168.1.76), u MongoDB.
|
||||
|
||||
## Co to je
|
||||
|
||||
Sjednocený **Tower-side ingest** JNJ e-mailů — spojuje dvě dříve oddělené poloviny
|
||||
do jednoho běhu:
|
||||
|
||||
| Fáze | Dříve samostatně | Co dělá |
|
||||
|---|---|---|
|
||||
| **1. PARSE** | `parse_emails_tower_v1.3.py` | `.msg` z `/mnt/JNJEMAILS` → bohatý dokument v Mongo `emaily."vbuzalka@its.jnj.com"` (tělo, přílohy, hlavičky, MAPI props). `_id` = Internet Message-ID. |
|
||||
| **2. SYNC** | `sync_jnj_state_v1.0.py` | nejnovější `/mnt/JNJEMAILS/db/jnjemails_*.db` (SQLite, **jen čtení** `mode=ro`) → zrcadlo do `jnj_messages` + doplnění `jnj_folder`/stavu do `emaily`. |
|
||||
|
||||
**Pořadí: parse BĚŽÍ PŘED sync.** Tím čerstvě naparsované maily dostanou cestu hned ve
|
||||
stejném běhu (dřív: když sync předběhl parse, nový mail neměl co matchnout — sync
|
||||
nezakládá stuby). Spojovací klíč všude = **Internet Message-ID = Mongo `_id`**.
|
||||
|
||||
## Inkrementálnost (vhodné pro cron každých 5 min)
|
||||
|
||||
- **PARSE** — parsuje jen `.msg` s `mtime` novějším než watermark
|
||||
(`jnj_sync_state` / `_id="parse_state"` → `last_parse_mtime`).
|
||||
- **První běh = seed:** watermark chybí → kandidáti = soubory, jejichž `filename`
|
||||
ještě není v Mongu (jednorázový `distinct("filename")`); poté se watermark
|
||||
nastaví na nejnovější soubor.
|
||||
- **Další běhy = incremental:** jen `mtime > watermark`. Žádný sken Monga.
|
||||
- `--full` reparsuje vše (upsert, idempotentní).
|
||||
- **Indexy** se vytvářejí jen při `full`/`seed`/`--reindex` (v incremental už existují).
|
||||
- **SYNC** — watermark `updated_at` (`jnj_sync_state` / `_id="watermark"`) + zkratka
|
||||
`last_db` (stejná SQLite jako minule → okamžitý no-op, nesahá na Mongo data).
|
||||
|
||||
Dvě nezávislé události (nová `.msg` / nová `.db`) → skript udělá jen tu fázi, co má
|
||||
práci; jinak levný no-op.
|
||||
|
||||
## Argumenty
|
||||
|
||||
| Argument | Význam |
|
||||
|---|---|
|
||||
| `--dry-run` | nic nezapíše, jen plán obou fází |
|
||||
| `--full` | parse: reparsuj vše; sync: ignoruj watermark |
|
||||
| `--limit N` | max N souborů (parse) / řádků (sync) — test |
|
||||
| `--reindex` | vynutí indexy po parse fázi |
|
||||
| `--force` | sync: ignoruj zkratku `last_db` |
|
||||
| `--parse-only` | jen fáze PARSE |
|
||||
| `--sync-only` | jen fáze SYNC |
|
||||
|
||||
## Spouštění
|
||||
|
||||
```bash
|
||||
# Test:
|
||||
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.0.py --dry-run
|
||||
# Ostrý inkrementální běh (volá ho cron):
|
||||
docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.0.py
|
||||
# Plný reparse + reindex:
|
||||
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.0.py --full --reindex
|
||||
```
|
||||
|
||||
## Plánování (HOTOVO)
|
||||
|
||||
Unraid User Scripts úloha `jnj_state_sync` (cron `*/5 * * * *`) — wrapper s `flock`
|
||||
volá `docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.0.py`.
|
||||
Loguje jen reálnou práci/chyby do `/mnt/user/Scripts/logs/jnj_tower_ingest.log`
|
||||
(grep `Zapisuji|PARSE hotovo|SYNC hotovo|CHYBA|Traceback`). Cron řádek/rozvrh se při
|
||||
přepnutí ze `sync_jnj_state` neměnil — jen obsah wrapperu.
|
||||
|
||||
## Revert
|
||||
|
||||
Staré skripty `parse_emails_tower_v1.3.py` a `sync_jnj_state_v1.0.py` zůstávají v
|
||||
`/scripts/` jako pojistka. Návrat = přepsat wrapper zpět na `sync_jnj_state_v1.0.py`.
|
||||
|
||||
## Závislosti
|
||||
|
||||
`extract-msg==0.55.0`, `olefile`, `pymongo`, `python-dateutil`, `sqlite3` (stdlib).
|
||||
Python 3.10+.
|
||||
|
||||
## Historie verzí
|
||||
|
||||
- **1.0.0** 2026-06-10 — sjednocení `parse_emails_tower_v1.3` + `sync_jnj_state_v1.0`;
|
||||
parse zinkrementálněn přes mtime watermark; indexy jen při full/seed/`--reindex`;
|
||||
pořadí parse→sync.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,514 @@
|
||||
"""
|
||||
==============================================================================
|
||||
Skript: 1b_parse_emails_graph_delta_v1.0.py
|
||||
Verze: 1.0
|
||||
Datum: 2026-06-04
|
||||
Autor: vladimir.buzalka
|
||||
|
||||
Popis:
|
||||
Inkrementalni sync emailu pres Microsoft Graph DELTA QUERY.
|
||||
Sourozenec `1_parse_emails_graph_v1.4.py` — kazdy resi jiny use case:
|
||||
|
||||
1_parse_emails_graph_v1.4.py = prvni plny import schranky
|
||||
1b_parse_emails_graph_delta_v1.0.py = pravidelny sync (zmeny od minula)
|
||||
|
||||
Delta query je server-side change tracking — Graph si pamatuje "zalozku"
|
||||
(deltaLink) a vraci jen to, co se od ni zmenilo:
|
||||
- nove zpravy
|
||||
- zmeny existujicich (isRead, flag, presun do jine slozky, kategorie)
|
||||
- SMAZANE zpravy (@removed) — definitivne smazane, nikoli v kosi
|
||||
|
||||
Pro mail v "Deleted Items" delta nic specialniho nedela — je to porad
|
||||
normalni zprava, jen s folder_path="Deleted Items". @removed prijde az
|
||||
kdyz uzivatel vysype kos / Shift+Del.
|
||||
|
||||
State:
|
||||
Kolekce `emaily.sync_state`, _id = "<mailbox>|<folder_id>".
|
||||
{
|
||||
mailbox, folder_id, folder_path,
|
||||
delta_link, # plny URL s $deltatoken na pristi beh
|
||||
last_run_at,
|
||||
cumulative_new, cumulative_sync, cumulative_removed
|
||||
}
|
||||
|
||||
Permanentne smazane zpravy:
|
||||
Skript je NEMAZE z Mongo. Pouze nastavi:
|
||||
permanently_deleted: True
|
||||
permanently_deleted_at: <UTC datetime detekce>
|
||||
Dohledani: col.find({"permanently_deleted": True})
|
||||
|
||||
Reuse:
|
||||
Funkce extract_message / extract_sync_fields se nactou primo z modulu
|
||||
1_parse_emails_graph_v1.4.py (importlib, file-based), aby se logika
|
||||
extrahce nikdy nerozesla.
|
||||
|
||||
Spousteni:
|
||||
python 1b_parse_emails_graph_delta_v1.0.py # VSECHNY schranky (mimo SKIP_MAILBOXES)
|
||||
python 1b_parse_emails_graph_delta_v1.0.py --mailbox ordinace@buzalkova.cz # jedna schranka
|
||||
python 1b_parse_emails_graph_delta_v1.0.py --mailbox ordinace@buzalkova.cz --folder Inbox
|
||||
python 1b_parse_emails_graph_delta_v1.0.py --reset # zahodit deltaLinky a najet znova
|
||||
python 1b_parse_emails_graph_delta_v1.0.py --dry-run # nic neulozit
|
||||
|
||||
SKIP_MAILBOXES (hardcoded):
|
||||
vbuzalka@its.jnj.com — JNJ tenant, nemame Graph API pristup. Pro tuto
|
||||
schranku je nutny samostatny skript (lokalni .msg).
|
||||
|
||||
Zavislosti:
|
||||
msal, requests, pymongo, python-dateutil
|
||||
Python 3.10+
|
||||
==============================================================================
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import importlib.util
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import msal
|
||||
import requests
|
||||
from pymongo import MongoClient, ASCENDING
|
||||
|
||||
if hasattr(sys.stdout, "reconfigure"):
|
||||
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
|
||||
|
||||
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
|
||||
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
|
||||
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
|
||||
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
|
||||
GRAPH_URL = "https://graph.microsoft.com/v1.0"
|
||||
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
MONGO_DB = "emaily"
|
||||
SYNC_STATE_COL = "sync_state"
|
||||
PAGE_SIZE = 100 # delta endpoint typicky vraci max 100/stranka
|
||||
LOG_FILE = Path(__file__).parent / "delta_errors.log"
|
||||
SCRIPT_VERSION = "1.0"
|
||||
|
||||
# Kolekce v `emaily` ktere NEJSOU mailboxy:
|
||||
NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state"}
|
||||
|
||||
# Schranky, kde NEMAME Graph API pristup — pri bezneho behu se preskoci.
|
||||
# Pro tyto je nutny separatni skript (napr. lokalni .msg parser).
|
||||
SKIP_MAILBOXES = {
|
||||
"vbuzalka@its.jnj.com", # JNJ tenant — nemame Graph credentials
|
||||
}
|
||||
|
||||
logging.basicConfig(
|
||||
filename=str(LOG_FILE),
|
||||
level=logging.ERROR,
|
||||
format="%(asctime)s | %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
# Co tahnout z delta endpointu (stejne jako MSG_SELECT v v1.4, mimo internetMessageHeaders
|
||||
# ktere delta neumi vratit pro vsechny polozky — pro nove zpravy si je dotahneme
|
||||
# samostatnym fetchem).
|
||||
DELTA_SELECT = (
|
||||
"id,internetMessageId,subject,bodyPreview,body,"
|
||||
"importance,isRead,isDraft,hasAttachments,"
|
||||
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
|
||||
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
|
||||
"conversationId,conversationIndex,parentFolderId,"
|
||||
"categories,flag,inferenceClassification"
|
||||
)
|
||||
|
||||
# Pro plne nacteni nove zpravy (vcetne hlavicek + priloh) pouzijeme stejny
|
||||
# select+expand jako v1.4
|
||||
FULL_FETCH_SELECT = (
|
||||
"id,internetMessageId,subject,bodyPreview,body,"
|
||||
"importance,isRead,isDraft,hasAttachments,"
|
||||
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
|
||||
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
|
||||
"conversationId,conversationIndex,parentFolderId,"
|
||||
"categories,flag,inferenceClassification,internetMessageHeaders"
|
||||
)
|
||||
FULL_FETCH_EXPAND = "attachments($select=id,name,contentType,size,isInline)"
|
||||
|
||||
# ─── Reuse extract logiky z v1.4 ──────────────────────────────────────────────
|
||||
|
||||
_HERE = Path(__file__).parent
|
||||
_V14_PATH = _HERE / "1_parse_emails_graph_v1.4.py"
|
||||
if not _V14_PATH.exists():
|
||||
print(f"CHYBA: chybi sourozenec {_V14_PATH.name} — extract logiku nelze nacist", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
_spec = importlib.util.spec_from_file_location("v14_parse", _V14_PATH)
|
||||
_v14 = importlib.util.module_from_spec(_spec)
|
||||
_spec.loader.exec_module(_v14)
|
||||
extract_message = _v14.extract_message
|
||||
extract_sync_fields = _v14.extract_sync_fields
|
||||
|
||||
# GRAPH_MAILBOX modul-level v v1.4 — pro extract neni potreba, ale pro
|
||||
# konzistenci nastavujeme ho v main()
|
||||
|
||||
# ─── Graph API ────────────────────────────────────────────────────────────────
|
||||
|
||||
_graph_token: Optional[str] = None
|
||||
|
||||
|
||||
def get_token() -> str:
|
||||
global _graph_token
|
||||
app = msal.ConfidentialClientApplication(
|
||||
GRAPH_CLIENT_ID,
|
||||
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
|
||||
client_credential=GRAPH_CLIENT_SECRET,
|
||||
)
|
||||
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
|
||||
if "access_token" not in result:
|
||||
raise RuntimeError(f"Graph auth failed: {result}")
|
||||
_graph_token = result["access_token"]
|
||||
return _graph_token
|
||||
|
||||
|
||||
class DeltaExpired(Exception):
|
||||
"""deltaLink expiroval (HTTP 410) — je nutne zacit od plne delta znovu."""
|
||||
|
||||
|
||||
def graph_get(url: str, params: dict = None, allow_410: bool = False) -> dict:
|
||||
"""GET na Graph s retry pri 401. Pri 410 a allow_410=True vyhodi DeltaExpired."""
|
||||
global _graph_token
|
||||
if not _graph_token:
|
||||
get_token()
|
||||
for attempt in range(3):
|
||||
r = requests.get(
|
||||
url,
|
||||
headers={"Authorization": f"Bearer {_graph_token}"},
|
||||
params=params,
|
||||
timeout=60,
|
||||
)
|
||||
if r.status_code == 401:
|
||||
get_token()
|
||||
continue
|
||||
if r.status_code == 410 and allow_410:
|
||||
raise DeltaExpired(url)
|
||||
if r.status_code == 429:
|
||||
# rate limit — respect Retry-After
|
||||
wait = int(r.headers.get("Retry-After", "5"))
|
||||
print(f" [429] cekam {wait}s ...")
|
||||
time.sleep(wait)
|
||||
continue
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
raise RuntimeError(f"Graph GET failed after retries: {url}")
|
||||
|
||||
|
||||
def get_all_folders(mailbox: str, parent_id: str = None, parent_path: str = "") -> list[dict]:
|
||||
if parent_id is None:
|
||||
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders"
|
||||
else:
|
||||
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders/{parent_id}/childFolders"
|
||||
|
||||
folders = []
|
||||
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
|
||||
while url:
|
||||
data = graph_get(url, params)
|
||||
for f in data.get("value", []):
|
||||
path = f"{parent_path}/{f['displayName']}".lstrip("/")
|
||||
folders.append({"id": f["id"], "path": path})
|
||||
if f.get("childFolderCount", 0) > 0:
|
||||
folders.extend(get_all_folders(mailbox, f["id"], path))
|
||||
url = data.get("@odata.nextLink")
|
||||
params = None
|
||||
return folders
|
||||
|
||||
|
||||
def fetch_full_message(mailbox: str, msg_id: str) -> Optional[dict]:
|
||||
"""Stahne celou zpravu vcetne hlavicek a priloh — pro nove zpravy zachycene v delte."""
|
||||
url = f"{GRAPH_URL}/users/{mailbox}/messages/{msg_id}"
|
||||
params = {"$select": FULL_FETCH_SELECT, "$expand": FULL_FETCH_EXPAND}
|
||||
try:
|
||||
return graph_get(url, params)
|
||||
except requests.HTTPError as e:
|
||||
logging.error("fetch_full_message %s: %s", msg_id, e)
|
||||
return None
|
||||
|
||||
|
||||
# ─── Delta iterace ────────────────────────────────────────────────────────────
|
||||
|
||||
def iter_folder_delta(mailbox: str, folder_id: str, delta_link: Optional[str], limit: int = 0):
|
||||
"""
|
||||
Generator: vraci (item, final_delta_link).
|
||||
item je dict s polozkou (bud zmena nebo {'@removed': ...}).
|
||||
Posledni vyhozeny tuple ma final_delta_link != None (zbytek None).
|
||||
|
||||
Pri HTTP 410 (expirovany deltaLink) vyhodi DeltaExpired — caller ma
|
||||
pustit znova s delta_link=None (= fresh full delta).
|
||||
"""
|
||||
if delta_link:
|
||||
url = delta_link
|
||||
params = None
|
||||
else:
|
||||
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders/{folder_id}/messages/delta"
|
||||
params = {"$select": DELTA_SELECT, "$top": PAGE_SIZE}
|
||||
|
||||
n = 0
|
||||
while url:
|
||||
data = graph_get(url, params, allow_410=True)
|
||||
params = None
|
||||
for item in data.get("value", []):
|
||||
yield item, None
|
||||
n += 1
|
||||
if limit and n >= limit:
|
||||
# ulozime aspon stavajici nextLink jako "delta" — neni to ciste,
|
||||
# ale pri --limit jde o test, takze pristi beh proste pocnize znovu
|
||||
return
|
||||
next_link = data.get("@odata.nextLink")
|
||||
final_link = data.get("@odata.deltaLink")
|
||||
if final_link:
|
||||
# konec — predame final delta
|
||||
yield None, final_link
|
||||
return
|
||||
url = next_link
|
||||
|
||||
|
||||
# ─── Per-folder sync ──────────────────────────────────────────────────────────
|
||||
|
||||
def sync_folder(col, sync_col, mailbox: str, folder: dict, dry_run: bool, limit: int) -> dict:
|
||||
"""Vrati statistiky."""
|
||||
fid = folder["id"]
|
||||
fpath = folder["path"]
|
||||
state_id = f"{mailbox}|{fid}"
|
||||
state = sync_col.find_one({"_id": state_id})
|
||||
delta_link = state.get("delta_link") if state else None
|
||||
|
||||
is_first_run = delta_link is None
|
||||
label = "FRESH" if is_first_run else "DELTA"
|
||||
print(f"\n[{label}] {fpath}")
|
||||
|
||||
stats = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
|
||||
final_delta = None
|
||||
|
||||
try:
|
||||
gen = iter_folder_delta(mailbox, fid, delta_link, limit=limit)
|
||||
for item, fin in gen:
|
||||
if fin:
|
||||
final_delta = fin
|
||||
break
|
||||
try:
|
||||
process_item(col, mailbox, fpath, item, stats, dry_run)
|
||||
except Exception as e:
|
||||
stats["errors"] += 1
|
||||
logging.error("process_item %s: %s", item.get("id", "?"), e)
|
||||
except DeltaExpired:
|
||||
print(f" [410] deltaLink expiroval — restart od fresh delta")
|
||||
# rekurzivni restart s vymazanym statem
|
||||
sync_col.delete_one({"_id": state_id})
|
||||
return sync_folder(col, sync_col, mailbox, folder, dry_run, limit)
|
||||
|
||||
print(f" new={stats['new']} sync={stats['sync']} removed={stats['removed']} err={stats['errors']}")
|
||||
|
||||
# Ulozit sync_state pokud mame final_delta a neni dry run
|
||||
if final_delta and not dry_run:
|
||||
sync_col.update_one(
|
||||
{"_id": state_id},
|
||||
{
|
||||
"$set": {
|
||||
"mailbox": mailbox,
|
||||
"folder_id": fid,
|
||||
"folder_path": fpath,
|
||||
"delta_link": final_delta,
|
||||
"last_run_at": datetime.now(timezone.utc).replace(tzinfo=None),
|
||||
},
|
||||
"$inc": {
|
||||
"cumulative_new": stats["new"],
|
||||
"cumulative_sync": stats["sync"],
|
||||
"cumulative_removed": stats["removed"],
|
||||
"run_count": 1,
|
||||
},
|
||||
},
|
||||
upsert=True,
|
||||
)
|
||||
elif not final_delta:
|
||||
# neprisel deltaLink (napr. limit nebo chyba) — nemenime state, pristi beh
|
||||
# bude pokracovat normalne podle stareho deltaLinku nebo zacne od fresh
|
||||
if not is_first_run:
|
||||
print(f" [pozn] delta neukoncena — pristi beh pojede od ulozeneho deltaLinku")
|
||||
|
||||
return stats
|
||||
|
||||
|
||||
def process_item(col, mailbox: str, folder_path: str, item: dict, stats: dict, dry_run: bool):
|
||||
"""Zpracuje jednu polozku z delta odpovedi."""
|
||||
# 1) Smazana zprava (@removed)
|
||||
if "@removed" in item or item.get("@removed.reason"):
|
||||
graph_id = item.get("id")
|
||||
if not graph_id:
|
||||
return
|
||||
if dry_run:
|
||||
print(f" REMOVED graph_id={graph_id[:30]}...")
|
||||
else:
|
||||
col.update_one(
|
||||
{"graph_id": graph_id},
|
||||
{"$set": {
|
||||
"permanently_deleted": True,
|
||||
"permanently_deleted_at": datetime.now(timezone.utc).replace(tzinfo=None),
|
||||
}},
|
||||
)
|
||||
stats["removed"] += 1
|
||||
return
|
||||
|
||||
# 2) Nova nebo zmenena zprava — rozhodneme podle existence graph_id v Mongo
|
||||
graph_id = item.get("id")
|
||||
if not graph_id:
|
||||
return
|
||||
|
||||
existing = col.find_one({"graph_id": graph_id}, {"_id": 1})
|
||||
|
||||
if existing:
|
||||
# Existujici zprava — update jen sync poli (delta payload je obsahuje)
|
||||
fields = extract_sync_fields(item, folder_path)
|
||||
if dry_run:
|
||||
print(f" SYNC {item.get('subject','')[:60]}")
|
||||
else:
|
||||
col.update_one({"_id": existing["_id"]}, {"$set": fields})
|
||||
stats["sync"] += 1
|
||||
else:
|
||||
# Nova zprava — pro telo+attachments+headers fetchneme plnou verzi
|
||||
full = fetch_full_message(mailbox, graph_id)
|
||||
if full is None:
|
||||
stats["errors"] += 1
|
||||
return
|
||||
doc = extract_message(full, folder_path)
|
||||
if doc is None:
|
||||
stats["errors"] += 1
|
||||
return
|
||||
if dry_run:
|
||||
print(f" NEW {doc.get('subject','')[:60]}")
|
||||
else:
|
||||
col.update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True)
|
||||
stats["new"] += 1
|
||||
|
||||
|
||||
# ─── Indexy pro sync_state ────────────────────────────────────────────────────
|
||||
|
||||
def ensure_sync_state_indexes(sync_col):
|
||||
sync_col.create_index([("mailbox", ASCENDING), ("folder_id", ASCENDING)])
|
||||
sync_col.create_index([("last_run_at", ASCENDING)])
|
||||
|
||||
|
||||
def ensure_perm_deleted_index(col):
|
||||
col.create_index([("permanently_deleted", ASCENDING)], sparse=True)
|
||||
|
||||
|
||||
# ─── Main ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
def discover_mailboxes(db) -> list[str]:
|
||||
"""Vrati seznam mailboxu = vsechny kolekce v `emaily` mimo NON_MAILBOX_COLLECTIONS
|
||||
a SKIP_MAILBOXES."""
|
||||
out = []
|
||||
for name in sorted(db.list_collection_names()):
|
||||
if name in NON_MAILBOX_COLLECTIONS:
|
||||
continue
|
||||
if name in SKIP_MAILBOXES:
|
||||
print(f" [skip] {name} — v SKIP_MAILBOXES (neni Graph pristup)")
|
||||
continue
|
||||
out.append(name)
|
||||
return out
|
||||
|
||||
|
||||
def sync_mailbox(client, mailbox: str, args) -> dict:
|
||||
"""Sync jedne schranky. Vraci totals dict."""
|
||||
_v14.GRAPH_MAILBOX = mailbox
|
||||
|
||||
print(f"\n========== {mailbox} ==========")
|
||||
|
||||
col = client[MONGO_DB][mailbox]
|
||||
sync_col = client[MONGO_DB][SYNC_STATE_COL]
|
||||
|
||||
if not args.dry_run:
|
||||
ensure_sync_state_indexes(sync_col)
|
||||
ensure_perm_deleted_index(col)
|
||||
|
||||
if args.reset:
|
||||
n = sync_col.delete_many({"mailbox": mailbox}).deleted_count
|
||||
print(f" --reset: smazano {n} deltaLinku pro {mailbox}")
|
||||
|
||||
print("Nacitam seznam slozek...")
|
||||
try:
|
||||
folders = get_all_folders(mailbox)
|
||||
except requests.HTTPError as e:
|
||||
print(f" CHYBA: nelze nacist slozky pro {mailbox}: {e}")
|
||||
logging.error("get_all_folders %s: %s", mailbox, e)
|
||||
return {"new": 0, "sync": 0, "removed": 0, "errors": 1}
|
||||
|
||||
if args.folder:
|
||||
folders = [f for f in folders if args.folder.lower() in f["path"].lower()]
|
||||
print(f" Slozek ke zpracovani: {len(folders)}")
|
||||
|
||||
totals = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
|
||||
for folder in folders:
|
||||
s = sync_folder(col, sync_col, mailbox, folder, args.dry_run, args.limit)
|
||||
for k in totals:
|
||||
totals[k] += s[k]
|
||||
print(f" -> mailbox total: new={totals['new']} sync={totals['sync']} removed={totals['removed']} err={totals['errors']}")
|
||||
return totals
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(description=f"parse_emails_graph delta sync v{SCRIPT_VERSION}")
|
||||
ap.add_argument("--mailbox", default="",
|
||||
help="E-mail schranky (= kolekce v Mongo). "
|
||||
"Bez argumentu projede vsechny schranky z `emaily` (mimo SKIP_MAILBOXES).")
|
||||
ap.add_argument("--folder", default="", help="Filtruje slozky obsahujici tento retezec (default: vsechny)")
|
||||
ap.add_argument("--limit", type=int, default=0, help="Max polozek na slozku (test)")
|
||||
ap.add_argument("--reset", action="store_true",
|
||||
help="Smaze deltaLinky pro vybrane schranky — pristi beh zacne od fresh delta")
|
||||
ap.add_argument("--dry-run", action="store_true", help="Nic neulozi do Mongo, jen vypise co by se stalo")
|
||||
args = ap.parse_args()
|
||||
|
||||
print(f"=== Delta sync v{SCRIPT_VERSION} ===")
|
||||
if args.dry_run:
|
||||
print(" DRY-RUN — zadne zmeny v Mongo")
|
||||
|
||||
print("Pripojuji se k MongoDB...")
|
||||
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
|
||||
client.admin.command("ping")
|
||||
db = client[MONGO_DB]
|
||||
|
||||
if args.mailbox:
|
||||
if args.mailbox in SKIP_MAILBOXES:
|
||||
print(f" CHYBA: {args.mailbox} je v SKIP_MAILBOXES — neni Graph pristup.")
|
||||
sys.exit(2)
|
||||
mailboxes = [args.mailbox]
|
||||
else:
|
||||
mailboxes = discover_mailboxes(db)
|
||||
print(f" Schranky ke zpracovani: {len(mailboxes)}")
|
||||
for m in mailboxes:
|
||||
print(f" {m}")
|
||||
|
||||
print("Token Graph API...")
|
||||
get_token()
|
||||
print(" OK")
|
||||
|
||||
t0 = time.time()
|
||||
grand = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
|
||||
per_mailbox = []
|
||||
for mb in mailboxes:
|
||||
try:
|
||||
s = sync_mailbox(client, mb, args)
|
||||
except Exception as e:
|
||||
print(f" FATAL pri sync {mb}: {e}")
|
||||
logging.error("sync_mailbox %s: %s", mb, e)
|
||||
s = {"new": 0, "sync": 0, "removed": 0, "errors": 1}
|
||||
per_mailbox.append((mb, s))
|
||||
for k in grand:
|
||||
grand[k] += s[k]
|
||||
|
||||
dt = time.time() - t0
|
||||
print(f"\n=== SHRNUTI ===")
|
||||
for mb, s in per_mailbox:
|
||||
print(f" {mb:40} new={s['new']:>5} sync={s['sync']:>5} removed={s['removed']:>4} err={s['errors']:>3}")
|
||||
print(f" {'TOTAL':40} new={grand['new']:>5} sync={grand['sync']:>5} removed={grand['removed']:>4} err={grand['errors']:>3}")
|
||||
print(f" trvalo: {dt:.1f} s")
|
||||
return 1 if grand["errors"] > 0 else 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main() or 0)
|
||||
@@ -0,0 +1,523 @@
|
||||
"""
|
||||
==============================================================================
|
||||
Skript: 1b_parse_emails_graph_delta_v1.1.py
|
||||
Verze: 1.1
|
||||
Datum: 2026-06-10
|
||||
Autor: vladimir.buzalka
|
||||
|
||||
Zmeny v1.1 (2026-06-10):
|
||||
- Bugfix: NON_MAILBOX_COLLECTIONS rozsireno o "jnj_messages" a
|
||||
"jnj_sync_state" (pomocne kolekce JNJ folder trackingu). Predtim je
|
||||
discover_mailboxes bral jako schranky -> Graph 404 na
|
||||
/users/jnj_messages/mailFolders -> cely krok 1b FAIL(1) pri kazdem behu.
|
||||
|
||||
Popis:
|
||||
Inkrementalni sync emailu pres Microsoft Graph DELTA QUERY.
|
||||
Sourozenec `1_parse_emails_graph_v1.4.py` — kazdy resi jiny use case:
|
||||
|
||||
1_parse_emails_graph_v1.4.py = prvni plny import schranky
|
||||
1b_parse_emails_graph_delta_v1.1.py = pravidelny sync (zmeny od minula)
|
||||
|
||||
Delta query je server-side change tracking — Graph si pamatuje "zalozku"
|
||||
(deltaLink) a vraci jen to, co se od ni zmenilo:
|
||||
- nove zpravy
|
||||
- zmeny existujicich (isRead, flag, presun do jine slozky, kategorie)
|
||||
- SMAZANE zpravy (@removed) — definitivne smazane, nikoli v kosi
|
||||
|
||||
Pro mail v "Deleted Items" delta nic specialniho nedela — je to porad
|
||||
normalni zprava, jen s folder_path="Deleted Items". @removed prijde az
|
||||
kdyz uzivatel vysype kos / Shift+Del.
|
||||
|
||||
State:
|
||||
Kolekce `emaily.sync_state`, _id = "<mailbox>|<folder_id>".
|
||||
{
|
||||
mailbox, folder_id, folder_path,
|
||||
delta_link, # plny URL s $deltatoken na pristi beh
|
||||
last_run_at,
|
||||
cumulative_new, cumulative_sync, cumulative_removed
|
||||
}
|
||||
|
||||
Permanentne smazane zpravy:
|
||||
Skript je NEMAZE z Mongo. Pouze nastavi:
|
||||
permanently_deleted: True
|
||||
permanently_deleted_at: <UTC datetime detekce>
|
||||
Dohledani: col.find({"permanently_deleted": True})
|
||||
|
||||
Reuse:
|
||||
Funkce extract_message / extract_sync_fields se nactou primo z modulu
|
||||
1_parse_emails_graph_v1.4.py (importlib, file-based), aby se logika
|
||||
extrahce nikdy nerozesla.
|
||||
|
||||
Spousteni:
|
||||
python 1b_parse_emails_graph_delta_v1.1.py # VSECHNY schranky (mimo SKIP_MAILBOXES)
|
||||
python 1b_parse_emails_graph_delta_v1.1.py --mailbox ordinace@buzalkova.cz # jedna schranka
|
||||
python 1b_parse_emails_graph_delta_v1.1.py --mailbox ordinace@buzalkova.cz --folder Inbox
|
||||
python 1b_parse_emails_graph_delta_v1.1.py --reset # zahodit deltaLinky a najet znova
|
||||
python 1b_parse_emails_graph_delta_v1.1.py --dry-run # nic neulozit
|
||||
|
||||
SKIP_MAILBOXES (hardcoded):
|
||||
vbuzalka@its.jnj.com — JNJ tenant, nemame Graph API pristup. Pro tuto
|
||||
schranku je nutny samostatny skript (lokalni .msg).
|
||||
|
||||
Zavislosti:
|
||||
msal, requests, pymongo, python-dateutil
|
||||
Python 3.10+
|
||||
==============================================================================
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import importlib.util
|
||||
import logging
|
||||
import sys
|
||||
import time
|
||||
from datetime import datetime, timezone
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
import msal
|
||||
import requests
|
||||
from pymongo import MongoClient, ASCENDING
|
||||
|
||||
if hasattr(sys.stdout, "reconfigure"):
|
||||
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
|
||||
|
||||
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
|
||||
GRAPH_TENANT_ID = "7d269944-37a4-43a1-8140-c7517dc426e9"
|
||||
GRAPH_CLIENT_ID = "4b222bfd-78c9-4239-a53f-43006b3ed07f"
|
||||
GRAPH_CLIENT_SECRET = "Txg8Q~MjhocuopxsJyJBhPmDfMxZ2r5WpTFj1dfk"
|
||||
GRAPH_URL = "https://graph.microsoft.com/v1.0"
|
||||
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
MONGO_DB = "emaily"
|
||||
SYNC_STATE_COL = "sync_state"
|
||||
PAGE_SIZE = 100 # delta endpoint typicky vraci max 100/stranka
|
||||
LOG_FILE = Path(__file__).parent / "delta_errors.log"
|
||||
SCRIPT_VERSION = "1.1"
|
||||
|
||||
# Kolekce v `emaily` ktere NEJSOU mailboxy:
|
||||
# (jnj_messages + jnj_sync_state = pomocne kolekce JNJ folder trackingu,
|
||||
# bez exclude je discover_mailboxes bere jako schranky -> Graph 404 -> FAIL)
|
||||
NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state",
|
||||
"jnj_messages", "jnj_sync_state"}
|
||||
|
||||
# Schranky, kde NEMAME Graph API pristup — pri bezneho behu se preskoci.
|
||||
# Pro tyto je nutny separatni skript (napr. lokalni .msg parser).
|
||||
SKIP_MAILBOXES = {
|
||||
"vbuzalka@its.jnj.com", # JNJ tenant — nemame Graph credentials
|
||||
}
|
||||
|
||||
logging.basicConfig(
|
||||
filename=str(LOG_FILE),
|
||||
level=logging.ERROR,
|
||||
format="%(asctime)s | %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
# Co tahnout z delta endpointu (stejne jako MSG_SELECT v v1.4, mimo internetMessageHeaders
|
||||
# ktere delta neumi vratit pro vsechny polozky — pro nove zpravy si je dotahneme
|
||||
# samostatnym fetchem).
|
||||
DELTA_SELECT = (
|
||||
"id,internetMessageId,subject,bodyPreview,body,"
|
||||
"importance,isRead,isDraft,hasAttachments,"
|
||||
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
|
||||
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
|
||||
"conversationId,conversationIndex,parentFolderId,"
|
||||
"categories,flag,inferenceClassification"
|
||||
)
|
||||
|
||||
# Pro plne nacteni nove zpravy (vcetne hlavicek + priloh) pouzijeme stejny
|
||||
# select+expand jako v1.4
|
||||
FULL_FETCH_SELECT = (
|
||||
"id,internetMessageId,subject,bodyPreview,body,"
|
||||
"importance,isRead,isDraft,hasAttachments,"
|
||||
"receivedDateTime,sentDateTime,createdDateTime,lastModifiedDateTime,"
|
||||
"sender,from,toRecipients,ccRecipients,bccRecipients,replyTo,"
|
||||
"conversationId,conversationIndex,parentFolderId,"
|
||||
"categories,flag,inferenceClassification,internetMessageHeaders"
|
||||
)
|
||||
FULL_FETCH_EXPAND = "attachments($select=id,name,contentType,size,isInline)"
|
||||
|
||||
# ─── Reuse extract logiky z v1.4 ──────────────────────────────────────────────
|
||||
|
||||
_HERE = Path(__file__).parent
|
||||
_V14_PATH = _HERE / "1_parse_emails_graph_v1.4.py"
|
||||
if not _V14_PATH.exists():
|
||||
print(f"CHYBA: chybi sourozenec {_V14_PATH.name} — extract logiku nelze nacist", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
_spec = importlib.util.spec_from_file_location("v14_parse", _V14_PATH)
|
||||
_v14 = importlib.util.module_from_spec(_spec)
|
||||
_spec.loader.exec_module(_v14)
|
||||
extract_message = _v14.extract_message
|
||||
extract_sync_fields = _v14.extract_sync_fields
|
||||
|
||||
# GRAPH_MAILBOX modul-level v v1.4 — pro extract neni potreba, ale pro
|
||||
# konzistenci nastavujeme ho v main()
|
||||
|
||||
# ─── Graph API ────────────────────────────────────────────────────────────────
|
||||
|
||||
_graph_token: Optional[str] = None
|
||||
|
||||
|
||||
def get_token() -> str:
|
||||
global _graph_token
|
||||
app = msal.ConfidentialClientApplication(
|
||||
GRAPH_CLIENT_ID,
|
||||
authority=f"https://login.microsoftonline.com/{GRAPH_TENANT_ID}",
|
||||
client_credential=GRAPH_CLIENT_SECRET,
|
||||
)
|
||||
result = app.acquire_token_for_client(scopes=["https://graph.microsoft.com/.default"])
|
||||
if "access_token" not in result:
|
||||
raise RuntimeError(f"Graph auth failed: {result}")
|
||||
_graph_token = result["access_token"]
|
||||
return _graph_token
|
||||
|
||||
|
||||
class DeltaExpired(Exception):
|
||||
"""deltaLink expiroval (HTTP 410) — je nutne zacit od plne delta znovu."""
|
||||
|
||||
|
||||
def graph_get(url: str, params: dict = None, allow_410: bool = False) -> dict:
|
||||
"""GET na Graph s retry pri 401. Pri 410 a allow_410=True vyhodi DeltaExpired."""
|
||||
global _graph_token
|
||||
if not _graph_token:
|
||||
get_token()
|
||||
for attempt in range(3):
|
||||
r = requests.get(
|
||||
url,
|
||||
headers={"Authorization": f"Bearer {_graph_token}"},
|
||||
params=params,
|
||||
timeout=60,
|
||||
)
|
||||
if r.status_code == 401:
|
||||
get_token()
|
||||
continue
|
||||
if r.status_code == 410 and allow_410:
|
||||
raise DeltaExpired(url)
|
||||
if r.status_code == 429:
|
||||
# rate limit — respect Retry-After
|
||||
wait = int(r.headers.get("Retry-After", "5"))
|
||||
print(f" [429] cekam {wait}s ...")
|
||||
time.sleep(wait)
|
||||
continue
|
||||
r.raise_for_status()
|
||||
return r.json()
|
||||
raise RuntimeError(f"Graph GET failed after retries: {url}")
|
||||
|
||||
|
||||
def get_all_folders(mailbox: str, parent_id: str = None, parent_path: str = "") -> list[dict]:
|
||||
if parent_id is None:
|
||||
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders"
|
||||
else:
|
||||
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders/{parent_id}/childFolders"
|
||||
|
||||
folders = []
|
||||
params = {"$top": 100, "$select": "id,displayName,childFolderCount"}
|
||||
while url:
|
||||
data = graph_get(url, params)
|
||||
for f in data.get("value", []):
|
||||
path = f"{parent_path}/{f['displayName']}".lstrip("/")
|
||||
folders.append({"id": f["id"], "path": path})
|
||||
if f.get("childFolderCount", 0) > 0:
|
||||
folders.extend(get_all_folders(mailbox, f["id"], path))
|
||||
url = data.get("@odata.nextLink")
|
||||
params = None
|
||||
return folders
|
||||
|
||||
|
||||
def fetch_full_message(mailbox: str, msg_id: str) -> Optional[dict]:
|
||||
"""Stahne celou zpravu vcetne hlavicek a priloh — pro nove zpravy zachycene v delte."""
|
||||
url = f"{GRAPH_URL}/users/{mailbox}/messages/{msg_id}"
|
||||
params = {"$select": FULL_FETCH_SELECT, "$expand": FULL_FETCH_EXPAND}
|
||||
try:
|
||||
return graph_get(url, params)
|
||||
except requests.HTTPError as e:
|
||||
logging.error("fetch_full_message %s: %s", msg_id, e)
|
||||
return None
|
||||
|
||||
|
||||
# ─── Delta iterace ────────────────────────────────────────────────────────────
|
||||
|
||||
def iter_folder_delta(mailbox: str, folder_id: str, delta_link: Optional[str], limit: int = 0):
|
||||
"""
|
||||
Generator: vraci (item, final_delta_link).
|
||||
item je dict s polozkou (bud zmena nebo {'@removed': ...}).
|
||||
Posledni vyhozeny tuple ma final_delta_link != None (zbytek None).
|
||||
|
||||
Pri HTTP 410 (expirovany deltaLink) vyhodi DeltaExpired — caller ma
|
||||
pustit znova s delta_link=None (= fresh full delta).
|
||||
"""
|
||||
if delta_link:
|
||||
url = delta_link
|
||||
params = None
|
||||
else:
|
||||
url = f"{GRAPH_URL}/users/{mailbox}/mailFolders/{folder_id}/messages/delta"
|
||||
params = {"$select": DELTA_SELECT, "$top": PAGE_SIZE}
|
||||
|
||||
n = 0
|
||||
while url:
|
||||
data = graph_get(url, params, allow_410=True)
|
||||
params = None
|
||||
for item in data.get("value", []):
|
||||
yield item, None
|
||||
n += 1
|
||||
if limit and n >= limit:
|
||||
# ulozime aspon stavajici nextLink jako "delta" — neni to ciste,
|
||||
# ale pri --limit jde o test, takze pristi beh proste pocnize znovu
|
||||
return
|
||||
next_link = data.get("@odata.nextLink")
|
||||
final_link = data.get("@odata.deltaLink")
|
||||
if final_link:
|
||||
# konec — predame final delta
|
||||
yield None, final_link
|
||||
return
|
||||
url = next_link
|
||||
|
||||
|
||||
# ─── Per-folder sync ──────────────────────────────────────────────────────────
|
||||
|
||||
def sync_folder(col, sync_col, mailbox: str, folder: dict, dry_run: bool, limit: int) -> dict:
|
||||
"""Vrati statistiky."""
|
||||
fid = folder["id"]
|
||||
fpath = folder["path"]
|
||||
state_id = f"{mailbox}|{fid}"
|
||||
state = sync_col.find_one({"_id": state_id})
|
||||
delta_link = state.get("delta_link") if state else None
|
||||
|
||||
is_first_run = delta_link is None
|
||||
label = "FRESH" if is_first_run else "DELTA"
|
||||
print(f"\n[{label}] {fpath}")
|
||||
|
||||
stats = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
|
||||
final_delta = None
|
||||
|
||||
try:
|
||||
gen = iter_folder_delta(mailbox, fid, delta_link, limit=limit)
|
||||
for item, fin in gen:
|
||||
if fin:
|
||||
final_delta = fin
|
||||
break
|
||||
try:
|
||||
process_item(col, mailbox, fpath, item, stats, dry_run)
|
||||
except Exception as e:
|
||||
stats["errors"] += 1
|
||||
logging.error("process_item %s: %s", item.get("id", "?"), e)
|
||||
except DeltaExpired:
|
||||
print(f" [410] deltaLink expiroval — restart od fresh delta")
|
||||
# rekurzivni restart s vymazanym statem
|
||||
sync_col.delete_one({"_id": state_id})
|
||||
return sync_folder(col, sync_col, mailbox, folder, dry_run, limit)
|
||||
|
||||
print(f" new={stats['new']} sync={stats['sync']} removed={stats['removed']} err={stats['errors']}")
|
||||
|
||||
# Ulozit sync_state pokud mame final_delta a neni dry run
|
||||
if final_delta and not dry_run:
|
||||
sync_col.update_one(
|
||||
{"_id": state_id},
|
||||
{
|
||||
"$set": {
|
||||
"mailbox": mailbox,
|
||||
"folder_id": fid,
|
||||
"folder_path": fpath,
|
||||
"delta_link": final_delta,
|
||||
"last_run_at": datetime.now(timezone.utc).replace(tzinfo=None),
|
||||
},
|
||||
"$inc": {
|
||||
"cumulative_new": stats["new"],
|
||||
"cumulative_sync": stats["sync"],
|
||||
"cumulative_removed": stats["removed"],
|
||||
"run_count": 1,
|
||||
},
|
||||
},
|
||||
upsert=True,
|
||||
)
|
||||
elif not final_delta:
|
||||
# neprisel deltaLink (napr. limit nebo chyba) — nemenime state, pristi beh
|
||||
# bude pokracovat normalne podle stareho deltaLinku nebo zacne od fresh
|
||||
if not is_first_run:
|
||||
print(f" [pozn] delta neukoncena — pristi beh pojede od ulozeneho deltaLinku")
|
||||
|
||||
return stats
|
||||
|
||||
|
||||
def process_item(col, mailbox: str, folder_path: str, item: dict, stats: dict, dry_run: bool):
|
||||
"""Zpracuje jednu polozku z delta odpovedi."""
|
||||
# 1) Smazana zprava (@removed)
|
||||
if "@removed" in item or item.get("@removed.reason"):
|
||||
graph_id = item.get("id")
|
||||
if not graph_id:
|
||||
return
|
||||
if dry_run:
|
||||
print(f" REMOVED graph_id={graph_id[:30]}...")
|
||||
else:
|
||||
col.update_one(
|
||||
{"graph_id": graph_id},
|
||||
{"$set": {
|
||||
"permanently_deleted": True,
|
||||
"permanently_deleted_at": datetime.now(timezone.utc).replace(tzinfo=None),
|
||||
}},
|
||||
)
|
||||
stats["removed"] += 1
|
||||
return
|
||||
|
||||
# 2) Nova nebo zmenena zprava — rozhodneme podle existence graph_id v Mongo
|
||||
graph_id = item.get("id")
|
||||
if not graph_id:
|
||||
return
|
||||
|
||||
existing = col.find_one({"graph_id": graph_id}, {"_id": 1})
|
||||
|
||||
if existing:
|
||||
# Existujici zprava — update jen sync poli (delta payload je obsahuje)
|
||||
fields = extract_sync_fields(item, folder_path)
|
||||
if dry_run:
|
||||
print(f" SYNC {item.get('subject','')[:60]}")
|
||||
else:
|
||||
col.update_one({"_id": existing["_id"]}, {"$set": fields})
|
||||
stats["sync"] += 1
|
||||
else:
|
||||
# Nova zprava — pro telo+attachments+headers fetchneme plnou verzi
|
||||
full = fetch_full_message(mailbox, graph_id)
|
||||
if full is None:
|
||||
stats["errors"] += 1
|
||||
return
|
||||
doc = extract_message(full, folder_path)
|
||||
if doc is None:
|
||||
stats["errors"] += 1
|
||||
return
|
||||
if dry_run:
|
||||
print(f" NEW {doc.get('subject','')[:60]}")
|
||||
else:
|
||||
col.update_one({"_id": doc["_id"]}, {"$set": doc}, upsert=True)
|
||||
stats["new"] += 1
|
||||
|
||||
|
||||
# ─── Indexy pro sync_state ────────────────────────────────────────────────────
|
||||
|
||||
def ensure_sync_state_indexes(sync_col):
|
||||
sync_col.create_index([("mailbox", ASCENDING), ("folder_id", ASCENDING)])
|
||||
sync_col.create_index([("last_run_at", ASCENDING)])
|
||||
|
||||
|
||||
def ensure_perm_deleted_index(col):
|
||||
col.create_index([("permanently_deleted", ASCENDING)], sparse=True)
|
||||
|
||||
|
||||
# ─── Main ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
def discover_mailboxes(db) -> list[str]:
|
||||
"""Vrati seznam mailboxu = vsechny kolekce v `emaily` mimo NON_MAILBOX_COLLECTIONS
|
||||
a SKIP_MAILBOXES."""
|
||||
out = []
|
||||
for name in sorted(db.list_collection_names()):
|
||||
if name in NON_MAILBOX_COLLECTIONS:
|
||||
continue
|
||||
if name in SKIP_MAILBOXES:
|
||||
print(f" [skip] {name} — v SKIP_MAILBOXES (neni Graph pristup)")
|
||||
continue
|
||||
out.append(name)
|
||||
return out
|
||||
|
||||
|
||||
def sync_mailbox(client, mailbox: str, args) -> dict:
|
||||
"""Sync jedne schranky. Vraci totals dict."""
|
||||
_v14.GRAPH_MAILBOX = mailbox
|
||||
|
||||
print(f"\n========== {mailbox} ==========")
|
||||
|
||||
col = client[MONGO_DB][mailbox]
|
||||
sync_col = client[MONGO_DB][SYNC_STATE_COL]
|
||||
|
||||
if not args.dry_run:
|
||||
ensure_sync_state_indexes(sync_col)
|
||||
ensure_perm_deleted_index(col)
|
||||
|
||||
if args.reset:
|
||||
n = sync_col.delete_many({"mailbox": mailbox}).deleted_count
|
||||
print(f" --reset: smazano {n} deltaLinku pro {mailbox}")
|
||||
|
||||
print("Nacitam seznam slozek...")
|
||||
try:
|
||||
folders = get_all_folders(mailbox)
|
||||
except requests.HTTPError as e:
|
||||
print(f" CHYBA: nelze nacist slozky pro {mailbox}: {e}")
|
||||
logging.error("get_all_folders %s: %s", mailbox, e)
|
||||
return {"new": 0, "sync": 0, "removed": 0, "errors": 1}
|
||||
|
||||
if args.folder:
|
||||
folders = [f for f in folders if args.folder.lower() in f["path"].lower()]
|
||||
print(f" Slozek ke zpracovani: {len(folders)}")
|
||||
|
||||
totals = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
|
||||
for folder in folders:
|
||||
s = sync_folder(col, sync_col, mailbox, folder, args.dry_run, args.limit)
|
||||
for k in totals:
|
||||
totals[k] += s[k]
|
||||
print(f" -> mailbox total: new={totals['new']} sync={totals['sync']} removed={totals['removed']} err={totals['errors']}")
|
||||
return totals
|
||||
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(description=f"parse_emails_graph delta sync v{SCRIPT_VERSION}")
|
||||
ap.add_argument("--mailbox", default="",
|
||||
help="E-mail schranky (= kolekce v Mongo). "
|
||||
"Bez argumentu projede vsechny schranky z `emaily` (mimo SKIP_MAILBOXES).")
|
||||
ap.add_argument("--folder", default="", help="Filtruje slozky obsahujici tento retezec (default: vsechny)")
|
||||
ap.add_argument("--limit", type=int, default=0, help="Max polozek na slozku (test)")
|
||||
ap.add_argument("--reset", action="store_true",
|
||||
help="Smaze deltaLinky pro vybrane schranky — pristi beh zacne od fresh delta")
|
||||
ap.add_argument("--dry-run", action="store_true", help="Nic neulozi do Mongo, jen vypise co by se stalo")
|
||||
args = ap.parse_args()
|
||||
|
||||
print(f"=== Delta sync v{SCRIPT_VERSION} ===")
|
||||
if args.dry_run:
|
||||
print(" DRY-RUN — zadne zmeny v Mongo")
|
||||
|
||||
print("Pripojuji se k MongoDB...")
|
||||
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
|
||||
client.admin.command("ping")
|
||||
db = client[MONGO_DB]
|
||||
|
||||
if args.mailbox:
|
||||
if args.mailbox in SKIP_MAILBOXES:
|
||||
print(f" CHYBA: {args.mailbox} je v SKIP_MAILBOXES — neni Graph pristup.")
|
||||
sys.exit(2)
|
||||
mailboxes = [args.mailbox]
|
||||
else:
|
||||
mailboxes = discover_mailboxes(db)
|
||||
print(f" Schranky ke zpracovani: {len(mailboxes)}")
|
||||
for m in mailboxes:
|
||||
print(f" {m}")
|
||||
|
||||
print("Token Graph API...")
|
||||
get_token()
|
||||
print(" OK")
|
||||
|
||||
t0 = time.time()
|
||||
grand = {"new": 0, "sync": 0, "removed": 0, "errors": 0}
|
||||
per_mailbox = []
|
||||
for mb in mailboxes:
|
||||
try:
|
||||
s = sync_mailbox(client, mb, args)
|
||||
except Exception as e:
|
||||
print(f" FATAL pri sync {mb}: {e}")
|
||||
logging.error("sync_mailbox %s: %s", mb, e)
|
||||
s = {"new": 0, "sync": 0, "removed": 0, "errors": 1}
|
||||
per_mailbox.append((mb, s))
|
||||
for k in grand:
|
||||
grand[k] += s[k]
|
||||
|
||||
dt = time.time() - t0
|
||||
print(f"\n=== SHRNUTI ===")
|
||||
for mb, s in per_mailbox:
|
||||
print(f" {mb:40} new={s['new']:>5} sync={s['sync']:>5} removed={s['removed']:>4} err={s['errors']:>3}")
|
||||
print(f" {'TOTAL':40} new={grand['new']:>5} sync={grand['sync']:>5} removed={grand['removed']:>4} err={grand['errors']:>3}")
|
||||
print(f" trvalo: {dt:.1f} s")
|
||||
return 1 if grand["errors"] > 0 else 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main() or 0)
|
||||
@@ -0,0 +1,579 @@
|
||||
"""
|
||||
==============================================================================
|
||||
Skript: enrich_fulltext_emails_v1.3.py
|
||||
Verze: 1.3
|
||||
Datum: 2026-06-04
|
||||
Autor: vladimir.buzalka
|
||||
|
||||
Popis:
|
||||
Vytahne plny text z emailu ulozenych v MongoDB (db: emaily) a ulozi ho do
|
||||
PostgreSQL (db: MongoEmaily, tabulka: emails) s GIN tsvector indexem.
|
||||
|
||||
Emaily se NESTAHUJI znovu - tela uz jsou v Mongo z parse_emails_graph_v1.4
|
||||
(a refetch_text_bodies_v1.0 pro stare plain-text emaily).
|
||||
Tento skript jen vybere prvni dostupne telo a posle text do PG na fulltext.
|
||||
|
||||
Zmeny v1.3.1 (2026-06-09):
|
||||
- Bugfix: _clean_for_pg nahrazuje osamocene surrogate (\\ud800-\\udfff) za U+FFFD.
|
||||
Drive jeden mail se surrogaty (napr. JNJ .msg) shodil celou davku a krok 5
|
||||
skoncil FAIL. EXTRACTOR_VERSION zustava 1.2 (neni zmena fallback logiky).
|
||||
|
||||
Zmeny v1.3 vs v1.2:
|
||||
- Bugfix: NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state"}
|
||||
(sync_state pribyla v delta syncu, predtim ji v1.2 brala jako mailbox).
|
||||
- --index-reset: pred zpracovanim schranky vymaze vsechny jeji emaily z PG
|
||||
(force re-extract; pouzij kdyz povysis EXTRACTOR_VERSION nebo chces ciste).
|
||||
- Vylepseny header per-mailbox: ukaze pocet v Mongu, v PG a k zpracovani.
|
||||
|
||||
Zmeny v1.2 vs v1.1:
|
||||
- S/MIME emaily: pokud unwrap_smime_v1.0 ulozil smime_body_text/smime_body_html,
|
||||
pouzije se PREFEROVANE pred bezvyznamnym wrapper telem.
|
||||
- body_source: nova hodnota "smime".
|
||||
- EXTRACTOR_VERSION=1.2 -> vsechny existujici emaily v PG se preparsuji.
|
||||
|
||||
Zmeny v1.1 vs v1.0:
|
||||
- Fallback poradi rozsireno o body_text.
|
||||
- body_source umi novou hodnotu "text" (plne plain-text telo, max 2 MB).
|
||||
|
||||
Zdroj:
|
||||
MongoDB 192.168.1.76 db=emaily kolekce=<mailbox>
|
||||
(krome NON_MAILBOX_COLLECTIONS)
|
||||
|
||||
Cil:
|
||||
PostgreSQL 192.168.1.76 db=MongoEmaily tabulka=emails
|
||||
tsvector config 'soubory' (sdileny - simple + unaccent)
|
||||
|
||||
Inkrementalita:
|
||||
Pokud (mailbox, message_id) jiz existuje a extractor_version je aktualni
|
||||
a modified_at v Mongo neni novejsi -> skip. Pri zmene verze extractoru
|
||||
se vse preparsuje. --index-reset to obejde a smaze PG pred behom.
|
||||
|
||||
Spusteni:
|
||||
python enrich_fulltext_emails_v1.3.py # vsechny schranky
|
||||
python enrich_fulltext_emails_v1.3.py --mailbox ordinace@buzalkova.cz
|
||||
python enrich_fulltext_emails_v1.3.py --limit 500 # test
|
||||
python enrich_fulltext_emails_v1.3.py --mailbox X --index-reset # smaze PG schranky a re-extrahuje vsechno
|
||||
python enrich_fulltext_emails_v1.3.py --index-reset # smaze CELY index a postavi znovu (POMALE!)
|
||||
==============================================================================
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import traceback
|
||||
from datetime import datetime, timezone
|
||||
from typing import Optional
|
||||
|
||||
import psycopg
|
||||
from bs4 import BeautifulSoup
|
||||
from pymongo import MongoClient
|
||||
|
||||
# --- konfigurace ------------------------------------------------------------
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
MONGO_DB = "emaily"
|
||||
|
||||
PG_DSN = ("host=192.168.1.76 port=5432 dbname=MongoEmaily "
|
||||
"user=vladimir.buzalka password=Vlado7309208104++")
|
||||
|
||||
EXTRACTOR_VERSION = "1.2" # NEMENIT pokud nemenis fallback logiku!
|
||||
|
||||
MAX_TEXT_BYTES = 5 * 1024 * 1024 # plain text max 5 MB
|
||||
|
||||
# Kolekce v `emaily` ktere NEJSOU mailboxy (nezpracovavame)
|
||||
NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state"}
|
||||
|
||||
BATCH_SIZE = 100
|
||||
|
||||
|
||||
# --- SCHEMA -----------------------------------------------------------------
|
||||
|
||||
SCHEMA_SQL = """
|
||||
CREATE EXTENSION IF NOT EXISTS unaccent;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
|
||||
DO $$
|
||||
BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_ts_config WHERE cfgname = 'soubory') THEN
|
||||
CREATE TEXT SEARCH CONFIGURATION soubory ( COPY = simple );
|
||||
ALTER TEXT SEARCH CONFIGURATION soubory
|
||||
ALTER MAPPING FOR hword, hword_part, word
|
||||
WITH unaccent, simple;
|
||||
END IF;
|
||||
END$$;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS emails (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
mailbox TEXT NOT NULL,
|
||||
message_id TEXT NOT NULL,
|
||||
graph_id TEXT,
|
||||
conversation_id TEXT,
|
||||
folder_path TEXT,
|
||||
subject TEXT,
|
||||
sender_email TEXT,
|
||||
sender_name TEXT,
|
||||
to_addrs TEXT,
|
||||
cc_addrs TEXT,
|
||||
bcc_addrs TEXT,
|
||||
sent_at TIMESTAMPTZ,
|
||||
received_at TIMESTAMPTZ,
|
||||
modified_at TIMESTAMPTZ,
|
||||
is_read BOOLEAN,
|
||||
is_draft BOOLEAN,
|
||||
has_attachments BOOLEAN,
|
||||
attachment_count INT,
|
||||
attachments_summary TEXT,
|
||||
body TEXT,
|
||||
body_length INT,
|
||||
body_source TEXT, -- 'html' | 'preview' | 'empty'
|
||||
tsv tsvector GENERATED ALWAYS AS (
|
||||
to_tsvector('soubory'::regconfig,
|
||||
left(
|
||||
coalesce(subject, '') || ' ' ||
|
||||
coalesce(sender_email, '') || ' ' ||
|
||||
coalesce(sender_name, '') || ' ' ||
|
||||
coalesce(to_addrs, '') || ' ' ||
|
||||
coalesce(cc_addrs, '') || ' ' ||
|
||||
coalesce(attachments_summary, '') || ' ' ||
|
||||
coalesce(body, ''),
|
||||
800000)
|
||||
)
|
||||
) STORED,
|
||||
extracted_at TIMESTAMPTZ DEFAULT now(),
|
||||
extractor_version TEXT,
|
||||
ok BOOLEAN,
|
||||
error TEXT,
|
||||
UNIQUE (mailbox, message_id)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS emails_tsv_gin ON emails USING gin(tsv);
|
||||
CREATE INDEX IF NOT EXISTS emails_subject_trgm ON emails USING gin(subject gin_trgm_ops);
|
||||
CREATE INDEX IF NOT EXISTS emails_sender_email_idx ON emails(sender_email);
|
||||
CREATE INDEX IF NOT EXISTS emails_mailbox_idx ON emails(mailbox);
|
||||
CREATE INDEX IF NOT EXISTS emails_received_idx ON emails(received_at DESC);
|
||||
CREATE INDEX IF NOT EXISTS emails_conv_idx ON emails(conversation_id);
|
||||
"""
|
||||
|
||||
|
||||
# --- HELPERY ----------------------------------------------------------------
|
||||
|
||||
_CTRL_RX = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f]")
|
||||
_WS_RX = re.compile(r"[ \t]+")
|
||||
_NL_RX = re.compile(r"\n{3,}")
|
||||
# Osamocene surrogate (\ud800-\udfff) jsou neplatne v UTF-8 -> psycopg pri zapisu
|
||||
# vyhodi UnicodeEncodeError ("surrogates not allowed") a shodi celou davku.
|
||||
# Vznikaji ze spatne dekodovanych tel (napr. nektere JNJ .msg). Nahradime je U+FFFD.
|
||||
_SURROGATE_RX = re.compile(r"[\ud800-\udfff]")
|
||||
|
||||
|
||||
def _clean_for_pg(s: str) -> str:
|
||||
if not s:
|
||||
return ""
|
||||
s = _CTRL_RX.sub("", s)
|
||||
if _SURROGATE_RX.search(s):
|
||||
s = _SURROGATE_RX.sub("�", s)
|
||||
return s
|
||||
|
||||
|
||||
def _truncate(s: str) -> str:
|
||||
s = _clean_for_pg(s or "")
|
||||
if not s:
|
||||
return ""
|
||||
b = s.encode("utf-8", errors="replace")
|
||||
if len(b) <= MAX_TEXT_BYTES:
|
||||
return s
|
||||
return b[:MAX_TEXT_BYTES].decode("utf-8", errors="ignore")
|
||||
|
||||
|
||||
def html_to_text(html: str) -> str:
|
||||
if not html:
|
||||
return ""
|
||||
try:
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
except Exception:
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
for tag in soup(["script", "style", "head"]):
|
||||
tag.decompose()
|
||||
text = soup.get_text(separator="\n")
|
||||
lines = [_WS_RX.sub(" ", ln).strip() for ln in text.split("\n")]
|
||||
text = "\n".join(ln for ln in lines if ln)
|
||||
text = _NL_RX.sub("\n\n", text)
|
||||
return text
|
||||
|
||||
|
||||
def fmt_recipients(recipients: list, kind: str) -> str:
|
||||
if not recipients:
|
||||
return ""
|
||||
out = []
|
||||
for r in recipients:
|
||||
if not isinstance(r, dict):
|
||||
continue
|
||||
if r.get("type") != kind:
|
||||
continue
|
||||
name = (r.get("name") or "").strip()
|
||||
email = (r.get("email") or "").strip()
|
||||
if name and email:
|
||||
out.append(f"{name} <{email}>")
|
||||
elif email:
|
||||
out.append(email)
|
||||
elif name:
|
||||
out.append(name)
|
||||
return "; ".join(out)
|
||||
|
||||
|
||||
def fmt_attachments(attachments: list) -> str:
|
||||
if not attachments:
|
||||
return ""
|
||||
out = []
|
||||
for a in attachments[:20]:
|
||||
if not isinstance(a, dict):
|
||||
continue
|
||||
name = a.get("name") or a.get("filename") or ""
|
||||
if name:
|
||||
out.append(name)
|
||||
return " | ".join(out)
|
||||
|
||||
|
||||
def _short(s, n=60):
|
||||
if not s:
|
||||
return ""
|
||||
s = str(s).replace("\n", " ").strip()
|
||||
return s if len(s) <= n else s[:n] + "..."
|
||||
|
||||
|
||||
def _now() -> datetime:
|
||||
return datetime.now(tz=timezone.utc)
|
||||
|
||||
|
||||
def _aware_utc(dt: Optional[datetime]) -> Optional[datetime]:
|
||||
"""Sjednoceni: PG TIMESTAMPTZ -> tz-aware UTC; Mongo datetime -> naive (UTC).
|
||||
Vrati tz-aware UTC datetime nebo None."""
|
||||
if dt is None:
|
||||
return None
|
||||
if dt.tzinfo is None:
|
||||
return dt.replace(tzinfo=timezone.utc)
|
||||
return dt.astimezone(timezone.utc)
|
||||
|
||||
|
||||
# --- HLAVNI SMYCKA ----------------------------------------------------------
|
||||
|
||||
def process_mailbox(pg: psycopg.Connection, mongo_coll, mailbox: str,
|
||||
limit: Optional[int] = None,
|
||||
index_reset: bool = False) -> dict:
|
||||
# --index-reset: smaz vse pro tuto schranku v PG
|
||||
if index_reset:
|
||||
with pg.cursor() as cur:
|
||||
cur.execute("DELETE FROM emails WHERE mailbox = %s", (mailbox,))
|
||||
deleted = cur.rowcount
|
||||
pg.commit()
|
||||
print(f"[{mailbox}] --index-reset: smazano {deleted} radku v PG")
|
||||
|
||||
# existujici zaznamy v PG (rychly inkrementalni lookup)
|
||||
# tuple = (extractor_version, ok, body_source)
|
||||
with pg.cursor() as cur:
|
||||
cur.execute(
|
||||
"SELECT message_id, extractor_version, ok, body_source "
|
||||
"FROM emails WHERE mailbox = %s",
|
||||
(mailbox,),
|
||||
)
|
||||
existing = {row[0]: (row[1], row[2], row[3]) for row in cur.fetchall()}
|
||||
|
||||
mongo_total = mongo_coll.estimated_document_count()
|
||||
pg_total = len(existing)
|
||||
pg_uptodate = sum(1 for v in existing.values()
|
||||
if v[0] == EXTRACTOR_VERSION and v[1])
|
||||
to_process_estimate = mongo_total - pg_uptodate
|
||||
print(f"\n========== {mailbox} ==========")
|
||||
print(f" v Mongu: {mongo_total}")
|
||||
print(f" v PG: {pg_total} (z toho ext_v={EXTRACTOR_VERSION} & ok=true: {pg_uptodate})")
|
||||
print(f" k zpracovani: ~{to_process_estimate}{' (limit=' + str(limit) + ')' if limit else ''}")
|
||||
|
||||
if to_process_estimate <= 0 and not index_reset and not limit:
|
||||
print(" Nic noveho ke zpracovani.")
|
||||
return {"mailbox": mailbox, "processed": 0, "ok": 0, "errors": 0,
|
||||
"skipped": pg_uptodate, "empty_body": 0}
|
||||
|
||||
proj = {
|
||||
"_id": 1, "graph_id": 1, "conversation_id": 1, "folder_path": 1,
|
||||
"subject": 1, "sender": 1, "recipients": 1,
|
||||
"sent_at": 1, "received_at": 1, "modified_at": 1,
|
||||
"is_read": 1, "is_draft": 1,
|
||||
"has_attachments": 1, "attachment_count": 1, "attachments": 1,
|
||||
"body_html": 1, "body_text": 1, "body_preview": 1,
|
||||
"smime_unwrapped": 1, "smime_body_text": 1, "smime_body_html": 1,
|
||||
"smime_subject": 1, "smime_inner_attachments": 1,
|
||||
}
|
||||
cursor = mongo_coll.find({}, proj, no_cursor_timeout=True)
|
||||
if limit:
|
||||
cursor = cursor.limit(limit)
|
||||
|
||||
processed = ok = errors = skipped = empty_body = 0
|
||||
queue: list[dict] = []
|
||||
n = 0
|
||||
|
||||
try:
|
||||
for doc in cursor:
|
||||
n += 1
|
||||
msg_id = doc.get("_id") or ""
|
||||
prev = existing.get(msg_id) # (extractor_version, ok, body_source)
|
||||
mongo_mtime = doc.get("modified_at")
|
||||
|
||||
# Skip kdyz PG ma stejnou EV a ok=true.
|
||||
# Vyjimka: smime_unwrapped v Mongu, ale PG body_source != 'smime'
|
||||
# -> unwrap_smime pridal rozbaleny text az po enrichu -> re-enrich.
|
||||
if prev and prev[0] == EXTRACTOR_VERSION and prev[1]:
|
||||
needs_smime_reindex = (
|
||||
bool(doc.get("smime_unwrapped"))
|
||||
and prev[2] != "smime"
|
||||
)
|
||||
if not needs_smime_reindex:
|
||||
skipped += 1
|
||||
continue
|
||||
|
||||
sender = doc.get("sender") or {}
|
||||
recipients = doc.get("recipients") or []
|
||||
attachments = doc.get("attachments") or []
|
||||
inner = doc.get("smime_inner_attachments") or []
|
||||
if inner:
|
||||
attachments = list(attachments) + [
|
||||
{"filename": (a.get("filename") or "") + " [smime]"}
|
||||
for a in inner if a.get("filename")
|
||||
]
|
||||
|
||||
row = {
|
||||
"mailbox": mailbox,
|
||||
"message_id": msg_id,
|
||||
"graph_id": doc.get("graph_id"),
|
||||
"conversation_id": doc.get("conversation_id"),
|
||||
"folder_path": doc.get("folder_path"),
|
||||
"subject": doc.get("subject") or "",
|
||||
"sender_email": sender.get("email"),
|
||||
"sender_name": sender.get("name"),
|
||||
"to_addrs": fmt_recipients(recipients, "to"),
|
||||
"cc_addrs": fmt_recipients(recipients, "cc"),
|
||||
"bcc_addrs": fmt_recipients(recipients, "bcc"),
|
||||
# Vsechny timestampy z Monga jsou naive ale interpretovany jako UTC.
|
||||
# Tagneme je tz-aware aby PG TIMESTAMPTZ ulozil spravnou UTC hodnotu
|
||||
# a nepocital posun podle session timezone.
|
||||
"sent_at": _aware_utc(doc.get("sent_at")),
|
||||
"received_at": _aware_utc(doc.get("received_at")),
|
||||
"modified_at": _aware_utc(mongo_mtime),
|
||||
"is_read": doc.get("is_read"),
|
||||
"is_draft": doc.get("is_draft"),
|
||||
"has_attachments": doc.get("has_attachments"),
|
||||
"attachment_count": doc.get("attachment_count"),
|
||||
"attachments_summary": fmt_attachments(attachments),
|
||||
"body": None,
|
||||
"body_length": 0,
|
||||
"body_source": "empty",
|
||||
"extracted_at": _now(),
|
||||
"extractor_version": EXTRACTOR_VERSION,
|
||||
"ok": False,
|
||||
"error": None,
|
||||
}
|
||||
|
||||
status = "OK "; detail = ""
|
||||
try:
|
||||
text = ""
|
||||
if doc.get("smime_unwrapped"):
|
||||
s_text = doc.get("smime_body_text") or ""
|
||||
s_html = doc.get("smime_body_html") or ""
|
||||
s_html_text = html_to_text(s_html) if s_html else ""
|
||||
combined = "\n\n".join(p for p in (s_text, s_html_text) if p)
|
||||
s_subject = doc.get("smime_subject") or ""
|
||||
if s_subject:
|
||||
combined = f"Subject: {s_subject}\n\n{combined}"
|
||||
if combined:
|
||||
text = combined
|
||||
row["body_source"] = "smime"
|
||||
if not text:
|
||||
html = doc.get("body_html") or ""
|
||||
h_text = html_to_text(html) if html else ""
|
||||
if h_text:
|
||||
text = h_text
|
||||
row["body_source"] = "html"
|
||||
if not text:
|
||||
plain = doc.get("body_text") or ""
|
||||
if plain:
|
||||
text = plain
|
||||
row["body_source"] = "text"
|
||||
if not text:
|
||||
preview = doc.get("body_preview") or ""
|
||||
if preview:
|
||||
text = preview
|
||||
row["body_source"] = "preview"
|
||||
if not text:
|
||||
row["body_source"] = "empty"
|
||||
empty_body += 1
|
||||
body = _truncate(text)
|
||||
row["body"] = body if body else None
|
||||
row["body_length"] = len(body)
|
||||
row["ok"] = True
|
||||
ok += 1
|
||||
detail = f"{len(body)} znaku {_short(body, 60)!r}"
|
||||
except Exception as e:
|
||||
row["error"] = f"{type(e).__name__}: {e}"[:500]
|
||||
status = "ERR"; detail = row["error"][:80]; errors += 1
|
||||
|
||||
queue.append(row)
|
||||
processed += 1
|
||||
|
||||
if processed % 200 == 0 or processed == 1:
|
||||
subj = _short(row["subject"], 50)
|
||||
print(f" [{n:>6}|p={processed:>5}] {status} {row['body_source']:<7} "
|
||||
f"{row['body_length']:>7}ch | {subj}", flush=True)
|
||||
|
||||
if len(queue) >= BATCH_SIZE:
|
||||
_flush(pg, queue); queue.clear()
|
||||
finally:
|
||||
cursor.close()
|
||||
|
||||
if queue:
|
||||
_flush(pg, queue)
|
||||
|
||||
return {"mailbox": mailbox, "processed": processed, "ok": ok,
|
||||
"errors": errors, "skipped": skipped, "empty_body": empty_body}
|
||||
|
||||
|
||||
UPSERT_SQL = """
|
||||
INSERT INTO emails
|
||||
(mailbox, message_id, graph_id, conversation_id, folder_path,
|
||||
subject, sender_email, sender_name, to_addrs, cc_addrs, bcc_addrs,
|
||||
sent_at, received_at, modified_at, is_read, is_draft,
|
||||
has_attachments, attachment_count, attachments_summary,
|
||||
body, body_length, body_source,
|
||||
extracted_at, extractor_version, ok, error)
|
||||
VALUES
|
||||
(%(mailbox)s, %(message_id)s, %(graph_id)s, %(conversation_id)s, %(folder_path)s,
|
||||
%(subject)s, %(sender_email)s, %(sender_name)s, %(to_addrs)s, %(cc_addrs)s, %(bcc_addrs)s,
|
||||
%(sent_at)s, %(received_at)s, %(modified_at)s, %(is_read)s, %(is_draft)s,
|
||||
%(has_attachments)s, %(attachment_count)s, %(attachments_summary)s,
|
||||
%(body)s, %(body_length)s, %(body_source)s,
|
||||
%(extracted_at)s, %(extractor_version)s, %(ok)s, %(error)s)
|
||||
ON CONFLICT (mailbox, message_id) DO UPDATE SET
|
||||
graph_id = EXCLUDED.graph_id,
|
||||
conversation_id = EXCLUDED.conversation_id,
|
||||
folder_path = EXCLUDED.folder_path,
|
||||
subject = EXCLUDED.subject,
|
||||
sender_email = EXCLUDED.sender_email,
|
||||
sender_name = EXCLUDED.sender_name,
|
||||
to_addrs = EXCLUDED.to_addrs,
|
||||
cc_addrs = EXCLUDED.cc_addrs,
|
||||
bcc_addrs = EXCLUDED.bcc_addrs,
|
||||
sent_at = EXCLUDED.sent_at,
|
||||
received_at = EXCLUDED.received_at,
|
||||
modified_at = EXCLUDED.modified_at,
|
||||
is_read = EXCLUDED.is_read,
|
||||
is_draft = EXCLUDED.is_draft,
|
||||
has_attachments = EXCLUDED.has_attachments,
|
||||
attachment_count = EXCLUDED.attachment_count,
|
||||
attachments_summary = EXCLUDED.attachments_summary,
|
||||
body = EXCLUDED.body,
|
||||
body_length = EXCLUDED.body_length,
|
||||
body_source = EXCLUDED.body_source,
|
||||
extracted_at = EXCLUDED.extracted_at,
|
||||
extractor_version = EXCLUDED.extractor_version,
|
||||
ok = EXCLUDED.ok,
|
||||
error = EXCLUDED.error
|
||||
"""
|
||||
|
||||
|
||||
def _flush(pg: psycopg.Connection, rows: list[dict]) -> None:
|
||||
for r in rows:
|
||||
for k in ("subject", "sender_email", "sender_name", "to_addrs", "cc_addrs",
|
||||
"bcc_addrs", "attachments_summary", "body", "error", "folder_path"):
|
||||
if r.get(k):
|
||||
r[k] = _clean_for_pg(r[k])
|
||||
with pg.cursor() as cur:
|
||||
cur.executemany(UPSERT_SQL, rows)
|
||||
pg.commit()
|
||||
|
||||
|
||||
def discover_mailboxes(db) -> list[str]:
|
||||
out = []
|
||||
for name in sorted(db.list_collection_names()):
|
||||
if name in NON_MAILBOX_COLLECTIONS:
|
||||
continue
|
||||
out.append(name)
|
||||
return out
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser(description="enrich_fulltext_emails v1.3")
|
||||
ap.add_argument("--mailbox", default="",
|
||||
help="Jedna konkretni schranka. Bez argumentu projede vsechny.")
|
||||
ap.add_argument("--limit", type=int,
|
||||
help="Limit emailu na schranku (test)")
|
||||
ap.add_argument("--index-reset", action="store_true",
|
||||
help="Pred zpracovanim schranky vymaze vsechny jeji emaily z PG "
|
||||
"(force re-extract). Bez --mailbox SMAZE CELY index.")
|
||||
args = ap.parse_args()
|
||||
|
||||
t0 = time.time()
|
||||
print(f"=== enrich_fulltext_emails v1.3 ===")
|
||||
print(f"Start: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
|
||||
print("\nPripojuji se k PostgreSQL...")
|
||||
pg = psycopg.connect(PG_DSN, connect_timeout=10)
|
||||
with pg.cursor() as cur:
|
||||
cur.execute(SCHEMA_SQL)
|
||||
pg.commit()
|
||||
print(" Schema OK.")
|
||||
|
||||
print("Pripojuji se k MongoDB...")
|
||||
mongo = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
|
||||
mongo.admin.command("ping")
|
||||
db = mongo[MONGO_DB]
|
||||
print(" MongoDB OK.")
|
||||
|
||||
if args.mailbox:
|
||||
mailboxes = [args.mailbox]
|
||||
else:
|
||||
mailboxes = discover_mailboxes(db)
|
||||
print(f"\nSchranky ke zpracovani ({len(mailboxes)}):")
|
||||
for mb in mailboxes:
|
||||
print(f" - {mb}")
|
||||
|
||||
if args.index_reset and not args.mailbox:
|
||||
print(f"\n!!! --index-reset bez --mailbox => SMAZE CELY INDEX ({len(mailboxes)} schranek) !!!")
|
||||
|
||||
results = []
|
||||
for mb in mailboxes:
|
||||
try:
|
||||
results.append(process_mailbox(pg, db[mb], mb,
|
||||
limit=args.limit,
|
||||
index_reset=args.index_reset))
|
||||
except Exception as e:
|
||||
traceback.print_exc()
|
||||
print(f" FATAL pri zpracovani {mb}: {e}")
|
||||
results.append({"mailbox": mb, "processed": 0, "ok": 0,
|
||||
"errors": 1, "skipped": 0, "empty_body": 0})
|
||||
|
||||
pg.close()
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("=== SHRNUTI ===")
|
||||
grand = {"processed": 0, "ok": 0, "errors": 0, "skipped": 0, "empty_body": 0}
|
||||
for r in results:
|
||||
print(f" {r['mailbox']:40} processed={r['processed']:>5} ok={r['ok']:>5} "
|
||||
f"errors={r['errors']:>3} skipped={r['skipped']:>6} empty={r['empty_body']:>4}")
|
||||
for k in grand:
|
||||
grand[k] += r.get(k, 0)
|
||||
print(f" {'TOTAL':40} processed={grand['processed']:>5} ok={grand['ok']:>5} "
|
||||
f"errors={grand['errors']:>3} skipped={grand['skipped']:>6} empty={grand['empty_body']:>4}")
|
||||
print(f"\nCelkem trvalo: {time.time() - t0:.1f} s")
|
||||
print(f"Konec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
# exit code: 0 jen kdyz vsechny schranky probehly bez chyby
|
||||
return 1 if grand["errors"] > 0 else 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
raise SystemExit(main())
|
||||
except KeyboardInterrupt:
|
||||
print("\nPreruseno uzivatelem")
|
||||
except Exception:
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
@@ -0,0 +1,587 @@
|
||||
"""
|
||||
==============================================================================
|
||||
Skript: enrich_fulltext_emails_v1.4.py
|
||||
Verze: 1.4
|
||||
Datum: 2026-06-10
|
||||
Autor: vladimir.buzalka
|
||||
|
||||
Zmeny v1.4 (2026-06-10):
|
||||
- Bugfix: NON_MAILBOX_COLLECTIONS rozsireno o "jnj_messages" a
|
||||
"jnj_sync_state" (pomocne kolekce JNJ folder trackingu). Predtim je
|
||||
discover_mailboxes bral jako schranky (jiny schema dokumentu) ->
|
||||
errors=1 -> cely krok 5 FAIL(1) pri kazdem behu pipeline.
|
||||
|
||||
Popis:
|
||||
Vytahne plny text z emailu ulozenych v MongoDB (db: emaily) a ulozi ho do
|
||||
PostgreSQL (db: MongoEmaily, tabulka: emails) s GIN tsvector indexem.
|
||||
|
||||
Emaily se NESTAHUJI znovu - tela uz jsou v Mongo z parse_emails_graph_v1.4
|
||||
(a refetch_text_bodies_v1.0 pro stare plain-text emaily).
|
||||
Tento skript jen vybere prvni dostupne telo a posle text do PG na fulltext.
|
||||
|
||||
Zmeny v1.3.1 (2026-06-09):
|
||||
- Bugfix: _clean_for_pg nahrazuje osamocene surrogate (\\ud800-\\udfff) za U+FFFD.
|
||||
Drive jeden mail se surrogaty (napr. JNJ .msg) shodil celou davku a krok 5
|
||||
skoncil FAIL. EXTRACTOR_VERSION zustava 1.2 (neni zmena fallback logiky).
|
||||
|
||||
Zmeny v1.3 vs v1.2:
|
||||
- Bugfix: NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state"}
|
||||
(sync_state pribyla v delta syncu, predtim ji v1.2 brala jako mailbox).
|
||||
- --index-reset: pred zpracovanim schranky vymaze vsechny jeji emaily z PG
|
||||
(force re-extract; pouzij kdyz povysis EXTRACTOR_VERSION nebo chces ciste).
|
||||
- Vylepseny header per-mailbox: ukaze pocet v Mongu, v PG a k zpracovani.
|
||||
|
||||
Zmeny v1.2 vs v1.1:
|
||||
- S/MIME emaily: pokud unwrap_smime_v1.0 ulozil smime_body_text/smime_body_html,
|
||||
pouzije se PREFEROVANE pred bezvyznamnym wrapper telem.
|
||||
- body_source: nova hodnota "smime".
|
||||
- EXTRACTOR_VERSION=1.2 -> vsechny existujici emaily v PG se preparsuji.
|
||||
|
||||
Zmeny v1.1 vs v1.0:
|
||||
- Fallback poradi rozsireno o body_text.
|
||||
- body_source umi novou hodnotu "text" (plne plain-text telo, max 2 MB).
|
||||
|
||||
Zdroj:
|
||||
MongoDB 192.168.1.76 db=emaily kolekce=<mailbox>
|
||||
(krome NON_MAILBOX_COLLECTIONS)
|
||||
|
||||
Cil:
|
||||
PostgreSQL 192.168.1.76 db=MongoEmaily tabulka=emails
|
||||
tsvector config 'soubory' (sdileny - simple + unaccent)
|
||||
|
||||
Inkrementalita:
|
||||
Pokud (mailbox, message_id) jiz existuje a extractor_version je aktualni
|
||||
a modified_at v Mongo neni novejsi -> skip. Pri zmene verze extractoru
|
||||
se vse preparsuje. --index-reset to obejde a smaze PG pred behom.
|
||||
|
||||
Spusteni:
|
||||
python enrich_fulltext_emails_v1.4.py # vsechny schranky
|
||||
python enrich_fulltext_emails_v1.4.py --mailbox ordinace@buzalkova.cz
|
||||
python enrich_fulltext_emails_v1.4.py --limit 500 # test
|
||||
python enrich_fulltext_emails_v1.4.py --mailbox X --index-reset # smaze PG schranky a re-extrahuje vsechno
|
||||
python enrich_fulltext_emails_v1.4.py --index-reset # smaze CELY index a postavi znovu (POMALE!)
|
||||
==============================================================================
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import traceback
|
||||
from datetime import datetime, timezone
|
||||
from typing import Optional
|
||||
|
||||
import psycopg
|
||||
from bs4 import BeautifulSoup
|
||||
from pymongo import MongoClient
|
||||
|
||||
# --- konfigurace ------------------------------------------------------------
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
MONGO_DB = "emaily"
|
||||
|
||||
PG_DSN = ("host=192.168.1.76 port=5432 dbname=MongoEmaily "
|
||||
"user=vladimir.buzalka password=Vlado7309208104++")
|
||||
|
||||
EXTRACTOR_VERSION = "1.2" # NEMENIT pokud nemenis fallback logiku!
|
||||
|
||||
MAX_TEXT_BYTES = 5 * 1024 * 1024 # plain text max 5 MB
|
||||
|
||||
# Kolekce v `emaily` ktere NEJSOU mailboxy (nezpracovavame)
|
||||
# (jnj_messages + jnj_sync_state = pomocne kolekce JNJ folder trackingu)
|
||||
NON_MAILBOX_COLLECTIONS = {"attachments_index", "sync_state",
|
||||
"jnj_messages", "jnj_sync_state"}
|
||||
|
||||
BATCH_SIZE = 100
|
||||
|
||||
|
||||
# --- SCHEMA -----------------------------------------------------------------
|
||||
|
||||
SCHEMA_SQL = """
|
||||
CREATE EXTENSION IF NOT EXISTS unaccent;
|
||||
CREATE EXTENSION IF NOT EXISTS pg_trgm;
|
||||
|
||||
DO $$
|
||||
BEGIN
|
||||
IF NOT EXISTS (SELECT 1 FROM pg_ts_config WHERE cfgname = 'soubory') THEN
|
||||
CREATE TEXT SEARCH CONFIGURATION soubory ( COPY = simple );
|
||||
ALTER TEXT SEARCH CONFIGURATION soubory
|
||||
ALTER MAPPING FOR hword, hword_part, word
|
||||
WITH unaccent, simple;
|
||||
END IF;
|
||||
END$$;
|
||||
|
||||
CREATE TABLE IF NOT EXISTS emails (
|
||||
id BIGSERIAL PRIMARY KEY,
|
||||
mailbox TEXT NOT NULL,
|
||||
message_id TEXT NOT NULL,
|
||||
graph_id TEXT,
|
||||
conversation_id TEXT,
|
||||
folder_path TEXT,
|
||||
subject TEXT,
|
||||
sender_email TEXT,
|
||||
sender_name TEXT,
|
||||
to_addrs TEXT,
|
||||
cc_addrs TEXT,
|
||||
bcc_addrs TEXT,
|
||||
sent_at TIMESTAMPTZ,
|
||||
received_at TIMESTAMPTZ,
|
||||
modified_at TIMESTAMPTZ,
|
||||
is_read BOOLEAN,
|
||||
is_draft BOOLEAN,
|
||||
has_attachments BOOLEAN,
|
||||
attachment_count INT,
|
||||
attachments_summary TEXT,
|
||||
body TEXT,
|
||||
body_length INT,
|
||||
body_source TEXT, -- 'html' | 'preview' | 'empty'
|
||||
tsv tsvector GENERATED ALWAYS AS (
|
||||
to_tsvector('soubory'::regconfig,
|
||||
left(
|
||||
coalesce(subject, '') || ' ' ||
|
||||
coalesce(sender_email, '') || ' ' ||
|
||||
coalesce(sender_name, '') || ' ' ||
|
||||
coalesce(to_addrs, '') || ' ' ||
|
||||
coalesce(cc_addrs, '') || ' ' ||
|
||||
coalesce(attachments_summary, '') || ' ' ||
|
||||
coalesce(body, ''),
|
||||
800000)
|
||||
)
|
||||
) STORED,
|
||||
extracted_at TIMESTAMPTZ DEFAULT now(),
|
||||
extractor_version TEXT,
|
||||
ok BOOLEAN,
|
||||
error TEXT,
|
||||
UNIQUE (mailbox, message_id)
|
||||
);
|
||||
|
||||
CREATE INDEX IF NOT EXISTS emails_tsv_gin ON emails USING gin(tsv);
|
||||
CREATE INDEX IF NOT EXISTS emails_subject_trgm ON emails USING gin(subject gin_trgm_ops);
|
||||
CREATE INDEX IF NOT EXISTS emails_sender_email_idx ON emails(sender_email);
|
||||
CREATE INDEX IF NOT EXISTS emails_mailbox_idx ON emails(mailbox);
|
||||
CREATE INDEX IF NOT EXISTS emails_received_idx ON emails(received_at DESC);
|
||||
CREATE INDEX IF NOT EXISTS emails_conv_idx ON emails(conversation_id);
|
||||
"""
|
||||
|
||||
|
||||
# --- HELPERY ----------------------------------------------------------------
|
||||
|
||||
_CTRL_RX = re.compile(r"[\x00-\x08\x0b\x0c\x0e-\x1f]")
|
||||
_WS_RX = re.compile(r"[ \t]+")
|
||||
_NL_RX = re.compile(r"\n{3,}")
|
||||
# Osamocene surrogate (\ud800-\udfff) jsou neplatne v UTF-8 -> psycopg pri zapisu
|
||||
# vyhodi UnicodeEncodeError ("surrogates not allowed") a shodi celou davku.
|
||||
# Vznikaji ze spatne dekodovanych tel (napr. nektere JNJ .msg). Nahradime je U+FFFD.
|
||||
_SURROGATE_RX = re.compile(r"[\ud800-\udfff]")
|
||||
|
||||
|
||||
def _clean_for_pg(s: str) -> str:
|
||||
if not s:
|
||||
return ""
|
||||
s = _CTRL_RX.sub("", s)
|
||||
if _SURROGATE_RX.search(s):
|
||||
s = _SURROGATE_RX.sub("�", s)
|
||||
return s
|
||||
|
||||
|
||||
def _truncate(s: str) -> str:
|
||||
s = _clean_for_pg(s or "")
|
||||
if not s:
|
||||
return ""
|
||||
b = s.encode("utf-8", errors="replace")
|
||||
if len(b) <= MAX_TEXT_BYTES:
|
||||
return s
|
||||
return b[:MAX_TEXT_BYTES].decode("utf-8", errors="ignore")
|
||||
|
||||
|
||||
def html_to_text(html: str) -> str:
|
||||
if not html:
|
||||
return ""
|
||||
try:
|
||||
soup = BeautifulSoup(html, "lxml")
|
||||
except Exception:
|
||||
soup = BeautifulSoup(html, "html.parser")
|
||||
for tag in soup(["script", "style", "head"]):
|
||||
tag.decompose()
|
||||
text = soup.get_text(separator="\n")
|
||||
lines = [_WS_RX.sub(" ", ln).strip() for ln in text.split("\n")]
|
||||
text = "\n".join(ln for ln in lines if ln)
|
||||
text = _NL_RX.sub("\n\n", text)
|
||||
return text
|
||||
|
||||
|
||||
def fmt_recipients(recipients: list, kind: str) -> str:
|
||||
if not recipients:
|
||||
return ""
|
||||
out = []
|
||||
for r in recipients:
|
||||
if not isinstance(r, dict):
|
||||
continue
|
||||
if r.get("type") != kind:
|
||||
continue
|
||||
name = (r.get("name") or "").strip()
|
||||
email = (r.get("email") or "").strip()
|
||||
if name and email:
|
||||
out.append(f"{name} <{email}>")
|
||||
elif email:
|
||||
out.append(email)
|
||||
elif name:
|
||||
out.append(name)
|
||||
return "; ".join(out)
|
||||
|
||||
|
||||
def fmt_attachments(attachments: list) -> str:
|
||||
if not attachments:
|
||||
return ""
|
||||
out = []
|
||||
for a in attachments[:20]:
|
||||
if not isinstance(a, dict):
|
||||
continue
|
||||
name = a.get("name") or a.get("filename") or ""
|
||||
if name:
|
||||
out.append(name)
|
||||
return " | ".join(out)
|
||||
|
||||
|
||||
def _short(s, n=60):
|
||||
if not s:
|
||||
return ""
|
||||
s = str(s).replace("\n", " ").strip()
|
||||
return s if len(s) <= n else s[:n] + "..."
|
||||
|
||||
|
||||
def _now() -> datetime:
|
||||
return datetime.now(tz=timezone.utc)
|
||||
|
||||
|
||||
def _aware_utc(dt: Optional[datetime]) -> Optional[datetime]:
|
||||
"""Sjednoceni: PG TIMESTAMPTZ -> tz-aware UTC; Mongo datetime -> naive (UTC).
|
||||
Vrati tz-aware UTC datetime nebo None."""
|
||||
if dt is None:
|
||||
return None
|
||||
if dt.tzinfo is None:
|
||||
return dt.replace(tzinfo=timezone.utc)
|
||||
return dt.astimezone(timezone.utc)
|
||||
|
||||
|
||||
# --- HLAVNI SMYCKA ----------------------------------------------------------
|
||||
|
||||
def process_mailbox(pg: psycopg.Connection, mongo_coll, mailbox: str,
|
||||
limit: Optional[int] = None,
|
||||
index_reset: bool = False) -> dict:
|
||||
# --index-reset: smaz vse pro tuto schranku v PG
|
||||
if index_reset:
|
||||
with pg.cursor() as cur:
|
||||
cur.execute("DELETE FROM emails WHERE mailbox = %s", (mailbox,))
|
||||
deleted = cur.rowcount
|
||||
pg.commit()
|
||||
print(f"[{mailbox}] --index-reset: smazano {deleted} radku v PG")
|
||||
|
||||
# existujici zaznamy v PG (rychly inkrementalni lookup)
|
||||
# tuple = (extractor_version, ok, body_source)
|
||||
with pg.cursor() as cur:
|
||||
cur.execute(
|
||||
"SELECT message_id, extractor_version, ok, body_source "
|
||||
"FROM emails WHERE mailbox = %s",
|
||||
(mailbox,),
|
||||
)
|
||||
existing = {row[0]: (row[1], row[2], row[3]) for row in cur.fetchall()}
|
||||
|
||||
mongo_total = mongo_coll.estimated_document_count()
|
||||
pg_total = len(existing)
|
||||
pg_uptodate = sum(1 for v in existing.values()
|
||||
if v[0] == EXTRACTOR_VERSION and v[1])
|
||||
to_process_estimate = mongo_total - pg_uptodate
|
||||
print(f"\n========== {mailbox} ==========")
|
||||
print(f" v Mongu: {mongo_total}")
|
||||
print(f" v PG: {pg_total} (z toho ext_v={EXTRACTOR_VERSION} & ok=true: {pg_uptodate})")
|
||||
print(f" k zpracovani: ~{to_process_estimate}{' (limit=' + str(limit) + ')' if limit else ''}")
|
||||
|
||||
if to_process_estimate <= 0 and not index_reset and not limit:
|
||||
print(" Nic noveho ke zpracovani.")
|
||||
return {"mailbox": mailbox, "processed": 0, "ok": 0, "errors": 0,
|
||||
"skipped": pg_uptodate, "empty_body": 0}
|
||||
|
||||
proj = {
|
||||
"_id": 1, "graph_id": 1, "conversation_id": 1, "folder_path": 1,
|
||||
"subject": 1, "sender": 1, "recipients": 1,
|
||||
"sent_at": 1, "received_at": 1, "modified_at": 1,
|
||||
"is_read": 1, "is_draft": 1,
|
||||
"has_attachments": 1, "attachment_count": 1, "attachments": 1,
|
||||
"body_html": 1, "body_text": 1, "body_preview": 1,
|
||||
"smime_unwrapped": 1, "smime_body_text": 1, "smime_body_html": 1,
|
||||
"smime_subject": 1, "smime_inner_attachments": 1,
|
||||
}
|
||||
cursor = mongo_coll.find({}, proj, no_cursor_timeout=True)
|
||||
if limit:
|
||||
cursor = cursor.limit(limit)
|
||||
|
||||
processed = ok = errors = skipped = empty_body = 0
|
||||
queue: list[dict] = []
|
||||
n = 0
|
||||
|
||||
try:
|
||||
for doc in cursor:
|
||||
n += 1
|
||||
msg_id = doc.get("_id") or ""
|
||||
prev = existing.get(msg_id) # (extractor_version, ok, body_source)
|
||||
mongo_mtime = doc.get("modified_at")
|
||||
|
||||
# Skip kdyz PG ma stejnou EV a ok=true.
|
||||
# Vyjimka: smime_unwrapped v Mongu, ale PG body_source != 'smime'
|
||||
# -> unwrap_smime pridal rozbaleny text az po enrichu -> re-enrich.
|
||||
if prev and prev[0] == EXTRACTOR_VERSION and prev[1]:
|
||||
needs_smime_reindex = (
|
||||
bool(doc.get("smime_unwrapped"))
|
||||
and prev[2] != "smime"
|
||||
)
|
||||
if not needs_smime_reindex:
|
||||
skipped += 1
|
||||
continue
|
||||
|
||||
sender = doc.get("sender") or {}
|
||||
recipients = doc.get("recipients") or []
|
||||
attachments = doc.get("attachments") or []
|
||||
inner = doc.get("smime_inner_attachments") or []
|
||||
if inner:
|
||||
attachments = list(attachments) + [
|
||||
{"filename": (a.get("filename") or "") + " [smime]"}
|
||||
for a in inner if a.get("filename")
|
||||
]
|
||||
|
||||
row = {
|
||||
"mailbox": mailbox,
|
||||
"message_id": msg_id,
|
||||
"graph_id": doc.get("graph_id"),
|
||||
"conversation_id": doc.get("conversation_id"),
|
||||
"folder_path": doc.get("folder_path"),
|
||||
"subject": doc.get("subject") or "",
|
||||
"sender_email": sender.get("email"),
|
||||
"sender_name": sender.get("name"),
|
||||
"to_addrs": fmt_recipients(recipients, "to"),
|
||||
"cc_addrs": fmt_recipients(recipients, "cc"),
|
||||
"bcc_addrs": fmt_recipients(recipients, "bcc"),
|
||||
# Vsechny timestampy z Monga jsou naive ale interpretovany jako UTC.
|
||||
# Tagneme je tz-aware aby PG TIMESTAMPTZ ulozil spravnou UTC hodnotu
|
||||
# a nepocital posun podle session timezone.
|
||||
"sent_at": _aware_utc(doc.get("sent_at")),
|
||||
"received_at": _aware_utc(doc.get("received_at")),
|
||||
"modified_at": _aware_utc(mongo_mtime),
|
||||
"is_read": doc.get("is_read"),
|
||||
"is_draft": doc.get("is_draft"),
|
||||
"has_attachments": doc.get("has_attachments"),
|
||||
"attachment_count": doc.get("attachment_count"),
|
||||
"attachments_summary": fmt_attachments(attachments),
|
||||
"body": None,
|
||||
"body_length": 0,
|
||||
"body_source": "empty",
|
||||
"extracted_at": _now(),
|
||||
"extractor_version": EXTRACTOR_VERSION,
|
||||
"ok": False,
|
||||
"error": None,
|
||||
}
|
||||
|
||||
status = "OK "; detail = ""
|
||||
try:
|
||||
text = ""
|
||||
if doc.get("smime_unwrapped"):
|
||||
s_text = doc.get("smime_body_text") or ""
|
||||
s_html = doc.get("smime_body_html") or ""
|
||||
s_html_text = html_to_text(s_html) if s_html else ""
|
||||
combined = "\n\n".join(p for p in (s_text, s_html_text) if p)
|
||||
s_subject = doc.get("smime_subject") or ""
|
||||
if s_subject:
|
||||
combined = f"Subject: {s_subject}\n\n{combined}"
|
||||
if combined:
|
||||
text = combined
|
||||
row["body_source"] = "smime"
|
||||
if not text:
|
||||
html = doc.get("body_html") or ""
|
||||
h_text = html_to_text(html) if html else ""
|
||||
if h_text:
|
||||
text = h_text
|
||||
row["body_source"] = "html"
|
||||
if not text:
|
||||
plain = doc.get("body_text") or ""
|
||||
if plain:
|
||||
text = plain
|
||||
row["body_source"] = "text"
|
||||
if not text:
|
||||
preview = doc.get("body_preview") or ""
|
||||
if preview:
|
||||
text = preview
|
||||
row["body_source"] = "preview"
|
||||
if not text:
|
||||
row["body_source"] = "empty"
|
||||
empty_body += 1
|
||||
body = _truncate(text)
|
||||
row["body"] = body if body else None
|
||||
row["body_length"] = len(body)
|
||||
row["ok"] = True
|
||||
ok += 1
|
||||
detail = f"{len(body)} znaku {_short(body, 60)!r}"
|
||||
except Exception as e:
|
||||
row["error"] = f"{type(e).__name__}: {e}"[:500]
|
||||
status = "ERR"; detail = row["error"][:80]; errors += 1
|
||||
|
||||
queue.append(row)
|
||||
processed += 1
|
||||
|
||||
if processed % 200 == 0 or processed == 1:
|
||||
subj = _short(row["subject"], 50)
|
||||
print(f" [{n:>6}|p={processed:>5}] {status} {row['body_source']:<7} "
|
||||
f"{row['body_length']:>7}ch | {subj}", flush=True)
|
||||
|
||||
if len(queue) >= BATCH_SIZE:
|
||||
_flush(pg, queue); queue.clear()
|
||||
finally:
|
||||
cursor.close()
|
||||
|
||||
if queue:
|
||||
_flush(pg, queue)
|
||||
|
||||
return {"mailbox": mailbox, "processed": processed, "ok": ok,
|
||||
"errors": errors, "skipped": skipped, "empty_body": empty_body}
|
||||
|
||||
|
||||
UPSERT_SQL = """
|
||||
INSERT INTO emails
|
||||
(mailbox, message_id, graph_id, conversation_id, folder_path,
|
||||
subject, sender_email, sender_name, to_addrs, cc_addrs, bcc_addrs,
|
||||
sent_at, received_at, modified_at, is_read, is_draft,
|
||||
has_attachments, attachment_count, attachments_summary,
|
||||
body, body_length, body_source,
|
||||
extracted_at, extractor_version, ok, error)
|
||||
VALUES
|
||||
(%(mailbox)s, %(message_id)s, %(graph_id)s, %(conversation_id)s, %(folder_path)s,
|
||||
%(subject)s, %(sender_email)s, %(sender_name)s, %(to_addrs)s, %(cc_addrs)s, %(bcc_addrs)s,
|
||||
%(sent_at)s, %(received_at)s, %(modified_at)s, %(is_read)s, %(is_draft)s,
|
||||
%(has_attachments)s, %(attachment_count)s, %(attachments_summary)s,
|
||||
%(body)s, %(body_length)s, %(body_source)s,
|
||||
%(extracted_at)s, %(extractor_version)s, %(ok)s, %(error)s)
|
||||
ON CONFLICT (mailbox, message_id) DO UPDATE SET
|
||||
graph_id = EXCLUDED.graph_id,
|
||||
conversation_id = EXCLUDED.conversation_id,
|
||||
folder_path = EXCLUDED.folder_path,
|
||||
subject = EXCLUDED.subject,
|
||||
sender_email = EXCLUDED.sender_email,
|
||||
sender_name = EXCLUDED.sender_name,
|
||||
to_addrs = EXCLUDED.to_addrs,
|
||||
cc_addrs = EXCLUDED.cc_addrs,
|
||||
bcc_addrs = EXCLUDED.bcc_addrs,
|
||||
sent_at = EXCLUDED.sent_at,
|
||||
received_at = EXCLUDED.received_at,
|
||||
modified_at = EXCLUDED.modified_at,
|
||||
is_read = EXCLUDED.is_read,
|
||||
is_draft = EXCLUDED.is_draft,
|
||||
has_attachments = EXCLUDED.has_attachments,
|
||||
attachment_count = EXCLUDED.attachment_count,
|
||||
attachments_summary = EXCLUDED.attachments_summary,
|
||||
body = EXCLUDED.body,
|
||||
body_length = EXCLUDED.body_length,
|
||||
body_source = EXCLUDED.body_source,
|
||||
extracted_at = EXCLUDED.extracted_at,
|
||||
extractor_version = EXCLUDED.extractor_version,
|
||||
ok = EXCLUDED.ok,
|
||||
error = EXCLUDED.error
|
||||
"""
|
||||
|
||||
|
||||
def _flush(pg: psycopg.Connection, rows: list[dict]) -> None:
|
||||
for r in rows:
|
||||
for k in ("subject", "sender_email", "sender_name", "to_addrs", "cc_addrs",
|
||||
"bcc_addrs", "attachments_summary", "body", "error", "folder_path"):
|
||||
if r.get(k):
|
||||
r[k] = _clean_for_pg(r[k])
|
||||
with pg.cursor() as cur:
|
||||
cur.executemany(UPSERT_SQL, rows)
|
||||
pg.commit()
|
||||
|
||||
|
||||
def discover_mailboxes(db) -> list[str]:
|
||||
out = []
|
||||
for name in sorted(db.list_collection_names()):
|
||||
if name in NON_MAILBOX_COLLECTIONS:
|
||||
continue
|
||||
out.append(name)
|
||||
return out
|
||||
|
||||
|
||||
def main() -> int:
|
||||
ap = argparse.ArgumentParser(description="enrich_fulltext_emails v1.4")
|
||||
ap.add_argument("--mailbox", default="",
|
||||
help="Jedna konkretni schranka. Bez argumentu projede vsechny.")
|
||||
ap.add_argument("--limit", type=int,
|
||||
help="Limit emailu na schranku (test)")
|
||||
ap.add_argument("--index-reset", action="store_true",
|
||||
help="Pred zpracovanim schranky vymaze vsechny jeji emaily z PG "
|
||||
"(force re-extract). Bez --mailbox SMAZE CELY index.")
|
||||
args = ap.parse_args()
|
||||
|
||||
t0 = time.time()
|
||||
print(f"=== enrich_fulltext_emails v1.4 ===")
|
||||
print(f"Start: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
|
||||
print("\nPripojuji se k PostgreSQL...")
|
||||
pg = psycopg.connect(PG_DSN, connect_timeout=10)
|
||||
with pg.cursor() as cur:
|
||||
cur.execute(SCHEMA_SQL)
|
||||
pg.commit()
|
||||
print(" Schema OK.")
|
||||
|
||||
print("Pripojuji se k MongoDB...")
|
||||
mongo = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
|
||||
mongo.admin.command("ping")
|
||||
db = mongo[MONGO_DB]
|
||||
print(" MongoDB OK.")
|
||||
|
||||
if args.mailbox:
|
||||
mailboxes = [args.mailbox]
|
||||
else:
|
||||
mailboxes = discover_mailboxes(db)
|
||||
print(f"\nSchranky ke zpracovani ({len(mailboxes)}):")
|
||||
for mb in mailboxes:
|
||||
print(f" - {mb}")
|
||||
|
||||
if args.index_reset and not args.mailbox:
|
||||
print(f"\n!!! --index-reset bez --mailbox => SMAZE CELY INDEX ({len(mailboxes)} schranek) !!!")
|
||||
|
||||
results = []
|
||||
for mb in mailboxes:
|
||||
try:
|
||||
results.append(process_mailbox(pg, db[mb], mb,
|
||||
limit=args.limit,
|
||||
index_reset=args.index_reset))
|
||||
except Exception as e:
|
||||
traceback.print_exc()
|
||||
print(f" FATAL pri zpracovani {mb}: {e}")
|
||||
results.append({"mailbox": mb, "processed": 0, "ok": 0,
|
||||
"errors": 1, "skipped": 0, "empty_body": 0})
|
||||
|
||||
pg.close()
|
||||
|
||||
print("\n" + "="*60)
|
||||
print("=== SHRNUTI ===")
|
||||
grand = {"processed": 0, "ok": 0, "errors": 0, "skipped": 0, "empty_body": 0}
|
||||
for r in results:
|
||||
print(f" {r['mailbox']:40} processed={r['processed']:>5} ok={r['ok']:>5} "
|
||||
f"errors={r['errors']:>3} skipped={r['skipped']:>6} empty={r['empty_body']:>4}")
|
||||
for k in grand:
|
||||
grand[k] += r.get(k, 0)
|
||||
print(f" {'TOTAL':40} processed={grand['processed']:>5} ok={grand['ok']:>5} "
|
||||
f"errors={grand['errors']:>3} skipped={grand['skipped']:>6} empty={grand['empty_body']:>4}")
|
||||
print(f"\nCelkem trvalo: {time.time() - t0:.1f} s")
|
||||
print(f"Konec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
# exit code: 0 jen kdyz vsechny schranky probehly bez chyby
|
||||
return 1 if grand["errors"] > 0 else 0
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
try:
|
||||
raise SystemExit(main())
|
||||
except KeyboardInterrupt:
|
||||
print("\nPreruseno uzivatelem")
|
||||
except Exception:
|
||||
traceback.print_exc()
|
||||
sys.exit(1)
|
||||
@@ -0,0 +1,289 @@
|
||||
# parse_emails_tower_v1.3
|
||||
|
||||
## Spuštění
|
||||
|
||||
**První spuštění:**
|
||||
```bash
|
||||
docker exec -d python-runner bash -c \
|
||||
"python /scripts/parse_emails_tower_v1.3.py > /scripts/parse_emails_tower.log 2>&1"
|
||||
```
|
||||
|
||||
**Pokračování po přerušení (přeskočí už importované):**
|
||||
```bash
|
||||
docker exec -d python-runner bash -c \
|
||||
"python /scripts/parse_emails_tower_v1.3.py --skip-existing > /scripts/parse_emails_tower.log 2>&1"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Stav importu
|
||||
|
||||
**Sledování průběhu (live log):**
|
||||
```bash
|
||||
docker exec -it python-runner tail -f /scripts/parse_emails_tower.log
|
||||
```
|
||||
|
||||
**Počet emailů v MongoDB:**
|
||||
```bash
|
||||
docker exec -it python-runner python -c \
|
||||
"from pymongo import MongoClient; c=MongoClient('mongodb://192.168.1.76:27017'); print(c['emaily']['vbuzalka@its.jnj.com'].count_documents({}))"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Název:** parse_emails_tower_v1.3.py
|
||||
**Verze:** 1.3
|
||||
**Datum:** 2026-06-08
|
||||
**Autor:** vladimir.buzalka
|
||||
|
||||
---
|
||||
|
||||
## Účel
|
||||
|
||||
Import všech `.msg` souborů do MongoDB. Z každého souboru extrahuje **všechny dostupné vlastnosti** — podobně jako EXIF u fotek.
|
||||
|
||||
- **DB:** `emaily`
|
||||
- **Kolekce:** `vbuzalka@its.jnj.com`
|
||||
- `_id` = Internet Message-ID (nebo `filename:<stem>` jako fallback)
|
||||
- Bezpečné přerušit a opakovat — upsert podle `_id`
|
||||
|
||||
---
|
||||
|
||||
## Prostředí
|
||||
|
||||
Běží v Docker containeru **python-runner** na **Unraid Tower**.
|
||||
|
||||
| Komponenta | Umístění |
|
||||
|---|---|
|
||||
| Container | `python-runner` (Docker na Unraid Tower) |
|
||||
| .msg soubory | `/mnt/user/JNJEMAILS` → `/mnt/JNJEMAILS` uvnitř containeru |
|
||||
| Skripty | `/mnt/user/Scripts` → `/scripts` uvnitř containeru |
|
||||
| MongoDB | `192.168.1.76:27017` (externí, mimo container) |
|
||||
|
||||
---
|
||||
|
||||
## Spouštění (z Unraid terminálu)
|
||||
|
||||
**Test na 50 emailech:**
|
||||
```bash
|
||||
docker exec -it python-runner python /scripts/parse_emails_tower_v1.3.py --limit 50 --no-indexes
|
||||
```
|
||||
|
||||
**Kompletní import na pozadí (log do souboru):**
|
||||
```bash
|
||||
docker exec -d python-runner bash -c \
|
||||
"python /scripts/parse_emails_tower_v1.3.py > /scripts/parse_emails_tower.log 2>&1"
|
||||
```
|
||||
|
||||
**Pokračování po přerušení:**
|
||||
```bash
|
||||
docker exec -d python-runner bash -c \
|
||||
"python /scripts/parse_emails_tower_v1.3.py --skip-existing > /scripts/parse_emails_tower.log 2>&1"
|
||||
```
|
||||
|
||||
**Sledování průběhu (Ctrl+C ukončí sledování, import běží dál):**
|
||||
```bash
|
||||
docker exec -it python-runner tail -f /scripts/parse_emails_tower.log
|
||||
```
|
||||
|
||||
### Všechny parametry
|
||||
|
||||
| Parametr | Popis |
|
||||
|---|---|
|
||||
| `--skip-existing` | Načte seznam hotových souborů z MongoDB a přeskočí je. Použij pro pokračování po přerušení. |
|
||||
| `--limit N` | Zpracuje jen prvních N souborů. Vhodné pro test. |
|
||||
| `--no-indexes` | Nevytváří indexy na konci. Použij pokud přerušíš uprostřed — indexy vytvoř ručně až je vše hotové. |
|
||||
| `--msgs-dir PATH` | Přepíše výchozí cestu k .msg souborům (výchozí: `/mnt/JNJEMAILS`). |
|
||||
|
||||
---
|
||||
|
||||
## Průběh na konzoli
|
||||
|
||||
Každý email na jednom řádku:
|
||||
```
|
||||
1/69371 OK RE: Protocol deviation CZ10022 jan.novak@its.jnj.com
|
||||
2/69371 OK UCO3001: Draft FUL pro DD5-CZ10022 monitor@4gclinical.com
|
||||
3/69371 ERR ? ?
|
||||
```
|
||||
|
||||
Každých 500 emailů oddělovač s průběhem:
|
||||
```
|
||||
────────────────────────────────────────────────────────────────────────────────
|
||||
Průběh: ok=498 err=2 0.4 msg/s ETA 47h12m
|
||||
────────────────────────────────────────────────────────────────────────────────
|
||||
```
|
||||
|
||||
Na konci souhrn:
|
||||
```
|
||||
====================================================
|
||||
Vysledek: ok=69300 | skip=0 | err=71
|
||||
Celkovy cas: 47h 23m 10s
|
||||
Dokumentu v kolekci: 69300
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Zdroje dat z každého .msg
|
||||
|
||||
| Pole | Popis |
|
||||
|---|---|
|
||||
| Předmět, normalized subject | |
|
||||
| Odesílatel | email, jméno, SMTP adresa |
|
||||
| Příjemci To/CC/BCC | strukturovaně `[{type, email, name}]` |
|
||||
| Čas doručení a odeslání | UTC |
|
||||
| Tělo | plaintext + HTML (max 2 MB) |
|
||||
| Přílohy | metadata: jméno, velikost, MIME typ, inline flag |
|
||||
| Internet headers | X-Originating-IP, Received, DKIM, X-Mailer, ... |
|
||||
| MAPI | důležitost, citlivost, příznak, konverzační vlákno, kategorie |
|
||||
| In-Reply-To, References | pro rekonstrukci vlákna |
|
||||
| Raw MAPI properties | `{0xXXXX: value}` |
|
||||
|
||||
---
|
||||
|
||||
## Hodnotové kódy
|
||||
|
||||
| Pole | Hodnota | Význam |
|
||||
|---|---|---|
|
||||
| `importance` | 0 | Nízká |
|
||||
| | 1 | Normální |
|
||||
| | 2 | Vysoká |
|
||||
| `sensitivity` | 0 | Normální |
|
||||
| | 1 | Osobní |
|
||||
| | 2 | Soukromé |
|
||||
| | 3 | Důvěrné |
|
||||
| `flag_status` | 0 | Bez příznaku |
|
||||
| | 1 | Označeno (follow up) |
|
||||
| | 2 | Dokončeno |
|
||||
|
||||
---
|
||||
|
||||
## MongoDB indexy
|
||||
|
||||
Automaticky vytvořeny na konci importu (`--no-indexes` přeskočí):
|
||||
|
||||
| Index | Pole |
|
||||
|---|---|
|
||||
| Chronologický | `received_at`, `sent_at` |
|
||||
| Odesílatel | `sender.email` |
|
||||
| Soubor | `filename` (unique) |
|
||||
| Konverzace | `conversation_topic` |
|
||||
| Filtry | `has_attachments`, `categories`, `importance`, `flag_status` |
|
||||
| Full-text | `subject` + `body_text` + `to` + `cc` (text index `text_search`) |
|
||||
|
||||
---
|
||||
|
||||
## Ukázkové dotazy (MongoDB shell / MCP)
|
||||
|
||||
**Emaily o UCO3001 s přílohou:**
|
||||
```javascript
|
||||
db["vbuzalka@its.jnj.com"].find({
|
||||
$text: { $search: "UCO3001" },
|
||||
has_attachments: true
|
||||
}).sort({ received_at: -1 })
|
||||
```
|
||||
|
||||
**Emaily od konkrétního odesílatele:**
|
||||
```javascript
|
||||
db["vbuzalka@its.jnj.com"].find({
|
||||
"sender.email": /covance/i
|
||||
}).sort({ received_at: -1 })
|
||||
```
|
||||
|
||||
**Celé konverzační vlákno:**
|
||||
```javascript
|
||||
db["vbuzalka@its.jnj.com"].find({
|
||||
conversation_topic: "Protocol deviation CZ10022"
|
||||
}).sort({ received_at: 1 })
|
||||
```
|
||||
|
||||
**Statistiky podle odesílatele (top 20):**
|
||||
```javascript
|
||||
db["vbuzalka@its.jnj.com"].aggregate([
|
||||
{ $group: { _id: "$sender.email", count: { $sum: 1 } } },
|
||||
{ $sort: { count: -1 } },
|
||||
{ $limit: 20 }
|
||||
])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Chybový log
|
||||
|
||||
Soubory které selhaly jsou zalogovány do **samostatného** `parse_emails_tower_errors.log` vedle skriptu (tj. `/scripts/parse_emails_tower_errors.log` → `\\tower\Scripts\parse_emails_tower_errors.log`). Tento log je oddělený od Graph importu, aby v něm nebyl bordel:
|
||||
```
|
||||
2026-06-08 12:40:33 | open failed [7A3F...0000.msg]: <důvod>
|
||||
2026-06-08 12:41:02 | per-dokument selhal [_id=<...>]: <důvod>
|
||||
```
|
||||
|
||||
Stdout (průběh) jde do `parse_emails_tower.log` — rovněž samostatný.
|
||||
|
||||
---
|
||||
|
||||
## Záchrana problémových .msg (v1.3)
|
||||
|
||||
Některé `.msg` defaultní `extract_msg` neumí otevřít a celý soubor zahodí, **i když email je naprosto v pořádku** (jde otevřít v Outlooku). Tři příčiny a jejich řešení:
|
||||
|
||||
| Příčina | Příklad | Řešení |
|
||||
|---|---|---|
|
||||
| Vadná příloha bez `PR_ATTACH_METHOD` | „Attachment method missing" | `errorBehavior=SUPPRESS_ALL` — vadnou přílohu přeskočí, zbytek (tělo, ostatní přílohy) načte |
|
||||
| Tělo deklaruje codepage 1200 (UTF-16), ale bajty jsou cp1250/gb2312 | české `�` místo diakritiky | raw-OLE čtení + kaskádové dekódování |
|
||||
| Vnořený email (Outlook item) | „not an MSG file", `extract_msg` vrátí prázdno | raw-OLE čtení klíčových MAPI streamů |
|
||||
|
||||
**Jak to funguje:**
|
||||
|
||||
1. `open_message()` — kaskádové otevření: `normal` → `SUPPRESS_ALL` → `+overrideEncoding` (dle codepage property).
|
||||
2. **raw-OLE fallback** — když extract_msg vrátí prázdno/`�` nebo musel hádat kódování, klíčová pole (subject, sender, body, html) se dočtou **přímo z OLE streamů** (`__substg1.0_0037`/`0C1A`/`5D01`/`1000`/`1013`) s kaskádovým dekódováním:
|
||||
```
|
||||
utf-8 (strict) → kódování dle CPID → cp1250 → cp1252 → gb2312 → latin-1
|
||||
```
|
||||
Hlavičkám o kódování se **nevěří** (často si protiřečí); bere se první kódování, které projde striktně bez chyby. `utf-8 strict` je silný rozlišovač.
|
||||
|
||||
**Nová pole v dokumentu:**
|
||||
|
||||
| Pole | Význam |
|
||||
|---|---|
|
||||
| `parse_mode` | `normal` / `suppress_all` / `override:<enc>` — jak byl soubor otevřen |
|
||||
| `parse_degraded` | `true` = byl potřeba fallback (vadná příloha nebo hádané kódování) |
|
||||
|
||||
**Ověřeno:** všech 126 dříve selhaných souborů z běhu 8.6. se obnoví čistě (74× `suppress_all`, 52× `override:cp1250`), 0 prázdných, 0 s `�`.
|
||||
|
||||
Dohledání degradovaných:
|
||||
```javascript
|
||||
db["vbuzalka@its.jnj.com"].find({ parse_degraded: true })
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Výkon
|
||||
|
||||
| Parametr | Hodnota |
|
||||
|---|---|
|
||||
| Počet souborů | ~69 000 |
|
||||
| Rychlost | ~0.4 msg/s (htmlBody dekódování) |
|
||||
| Odhadovaný čas | 48 hodin |
|
||||
| Batch size | 200 dokumentů / bulk_write |
|
||||
| Odhadovaná velikost DB | 2–5 GB |
|
||||
|
||||
---
|
||||
|
||||
## Závislosti (v Docker image python-runner)
|
||||
|
||||
```
|
||||
extract-msg==0.55.0
|
||||
olefile
|
||||
pymongo
|
||||
python-dateutil
|
||||
```
|
||||
|
||||
Image sestaven z `Dockerfile` v `/mnt/user/Scripts/python-runner/`.
|
||||
|
||||
---
|
||||
|
||||
## Historie verzí
|
||||
|
||||
| Verze | Datum | Změna |
|
||||
|---|---|---|
|
||||
| 1.0 | 2026-06-01 | Iniciální verze |
|
||||
| 1.1 | 2026-06-02 | Nasazení na Unraid Tower v Docker containeru python-runner; MSGS_DIR změněno z SMB share (`\\tower\JNJEMAILS`) na lokální mount (`/mnt/JNJEMAILS`); aktualizován popis spouštění pro `docker exec` |
|
||||
| 1.2 | 2026-06-08 | **Oprava `to_bson`:** int mimo rozsah int64 (BSON umí jen 8-byte ints) se převede na string — dřív celý `bulk_write` spadl na `MongoDB can only handle up to 8-byte ints` a zahodil celou dávku 200 dokumentů (běh v1.1 z 8.6. neuložil **nic**). `flush()` má fallback per-dokument (vadný záznam zahodí sám, ne celou dávku). `bool()` testován před `int()`. Samostatné logy `parse_emails_tower.log` + `parse_emails_tower_errors.log`. |
|
||||
| 1.3 | 2026-06-08 | **Záchrana dříve selhaných .msg** (cca 126 z běhu 8.6.): `open_message()` kaskádové otevření (`normal`→`SUPPRESS_ALL`→`+overrideEncoding`) řeší vadné přílohy i „not an MSG file"; **raw-OLE fallback** dočítá subject/sender/body/html přímo z OLE streamů s kaskádovým dekódováním (utf-8 strict→CPID→cp1250…), když extract_msg vrátí prázdno/`�`. Nová pole `parse_mode`, `parse_degraded`. Nová závislost `olefile`. Ověřeno: 126/126 obnoveno čistě. |
|
||||
@@ -0,0 +1,896 @@
|
||||
"""
|
||||
parse_emails_tower_v1.3.py
|
||||
Nazev: parse_emails_tower_v1.3.py
|
||||
Verze: 1.3
|
||||
Datum: 2026-06-08
|
||||
Autor: vladimir.buzalka
|
||||
|
||||
Popis:
|
||||
Parsuje vsechny .msg soubory z MSGS_DIR a importuje je jako dokumenty
|
||||
do MongoDB. Z kazdeho souboru extrahuje VSECHNY dostupne vlastnosti —
|
||||
podobne jako EXIF u fotek:
|
||||
|
||||
- predmet, odesilatel, prijemci (To/CC/BCC s typy)
|
||||
- cas doruceni a odeslani (UTC)
|
||||
- telo plaintext + HTML (max 2 MB)
|
||||
- prilohy (metadata: jmeno, velikost, MIME typ, inline flag)
|
||||
- internet headers (X-Originating-IP, Received, DKIM, ...)
|
||||
- MAPI vlastnosti: dulezitost, citlivost, priznak, konverzacni vlakno,
|
||||
kategorie, In-Reply-To, References, ...
|
||||
- vsechny raw MAPI properties jako {0xXXXX: value}
|
||||
|
||||
DB: emaily
|
||||
Kolekce: vbuzalka@its.jnj.com
|
||||
_id: Internet Message-ID (nebo "filename:<stem>" jako fallback)
|
||||
|
||||
Bezpecne prerusit a opakovat:
|
||||
- upsert podle _id — duplicity se automaticky prepisi
|
||||
- --skip-existing nacte seznam hotovych souboru z MongoDB a
|
||||
preskoci je => pokracovani po preruseni bez ztraty prace
|
||||
|
||||
Prostredi:
|
||||
Bezi v Docker containeru "python-runner" na Unraid Tower.
|
||||
.msg soubory jsou dostupne jako lokalni disk (volume mount):
|
||||
/mnt/user/JNJEMAILS -> /mnt/JNJEMAILS (uvnitr containeru)
|
||||
MongoDB na 192.168.1.76:27017 (externi, bezi mimo container).
|
||||
|
||||
Spousteni (z Unraid terminalu):
|
||||
# Test na 50 emailech:
|
||||
docker exec -it python-runner python /scripts/parse_emails_tower_v1.3.py --limit 50 --no-indexes
|
||||
|
||||
# Kompletni import na pozadi (samostatny log, ne sdileny s Graph importem):
|
||||
docker exec -d python-runner bash -c \
|
||||
"python /scripts/parse_emails_tower_v1.3.py > /scripts/parse_emails_tower.log 2>&1"
|
||||
|
||||
# Pokracovani po preruseni:
|
||||
docker exec -d python-runner bash -c \
|
||||
"python /scripts/parse_emails_tower_v1.3.py --skip-existing > /scripts/parse_emails_tower.log 2>&1"
|
||||
|
||||
# Sledovani prubehu:
|
||||
docker exec -it python-runner tail -f /scripts/parse_emails_tower.log
|
||||
|
||||
Vystup na konzoli:
|
||||
Kazdy email na jednom radku:
|
||||
<poradi>/<celkem> OK/ERR <predmet 60 znaku> <odesilatel>
|
||||
Kazych 500 emailu: oddelovac s prubehem, rychlosti a ETA.
|
||||
Na konci: souhrn ok/skip/err, celkovy cas, pocet dokumentu v kolekci.
|
||||
|
||||
Zavislosti (nainstalovane v Docker image python-runner):
|
||||
extract-msg==0.55.0, olefile, pymongo, python-dateutil
|
||||
Python 3.12, Linux (Docker container na Unraid Tower)
|
||||
(olefile je tranzitivni zavislost extract-msg, raw-OLE fallback ji pouziva primo)
|
||||
|
||||
Struktura dokumentu v MongoDB:
|
||||
_id Internet Message-ID (nebo filename: fallback)
|
||||
filename jmeno .msg souboru (20znakovy hex + .msg)
|
||||
subject predmet zpravy
|
||||
normalized_subject predmet bez RE:/FW: prefixu
|
||||
importance 0=nizka 1=normalni 2=vysoka
|
||||
sensitivity 0=normalni 1=osobni 2=soukrome 3=duverne
|
||||
flag_status 0=bez priznaku 1=oznaceno 2=dokonceno
|
||||
read_receipt_requested bool
|
||||
delivery_receipt_requested bool
|
||||
has_attachments bool
|
||||
attachment_count int
|
||||
message_size_bytes velikost .msg souboru na disku
|
||||
conversation_topic tema vlakna (PR_CONVERSATION_TOPIC)
|
||||
conversation_index base64 PR_CONVERSATION_INDEX
|
||||
in_reply_to Message-ID predchozi zpravy
|
||||
internet_references [Message-ID] — cela historia vlakna
|
||||
categories [str] — MAPI kategorie / stitky
|
||||
read_receipt_requested bool
|
||||
delivery_receipt_requested bool
|
||||
received_at datetime UTC — cas doruceni
|
||||
sent_at datetime UTC — cas odeslani
|
||||
sender.email emailova adresa odesilatele
|
||||
sender.name zobrazovane jmeno odesilatele
|
||||
sender.smtp SMTP adresa (pro interni EX adresy)
|
||||
to retezec To (tak jak v Outlooku)
|
||||
cc retezec CC
|
||||
bcc retezec BCC
|
||||
display_to PR_DISPLAY_TO (zkraceny seznam)
|
||||
display_cc PR_DISPLAY_CC
|
||||
recipients [{type, email, name}] — to/cc/bcc s typy
|
||||
body_text plain text telo
|
||||
body_html HTML telo (max 2 MB, None pokud neni)
|
||||
attachments [{filename, size_bytes, mime_type,
|
||||
content_id, is_inline}]
|
||||
headers dict internet headers (lowercase_s_podtrzitky)
|
||||
mapi dict vsech raw MAPI properties {0xXXXX: value}
|
||||
parsed_at datetime UTC — cas parsovani
|
||||
|
||||
Indexy (vytvoreny automaticky na konci):
|
||||
received_at, sent_at, sender.email, filename (unique),
|
||||
conversation_topic, has_attachments, categories, importance,
|
||||
flag_status, text_search (subject + body_text + to + cc)
|
||||
|
||||
Chyby:
|
||||
Soubory ktere selhaly jsou zalogovany do parse_emails_tower_errors.log
|
||||
v adresari skriptu (SAMOSTATNY log, oddeleny od Graph importu).
|
||||
Radek: timestamp | open/extract failed | duvod.
|
||||
|
||||
Historie verzi:
|
||||
1.0 2026-06-01 Inicialni verze
|
||||
1.1 2026-06-02 Nasazeni na Unraid Tower v Docker containeru python-runner;
|
||||
MSGS_DIR zmeneno z SMB share na lokalni mount /mnt/JNJEMAILS;
|
||||
aktualizovany popis spousteni pro docker exec
|
||||
1.2 2026-06-08 OPRAVA: to_bson prevadi int mimo rozsah int64 na string
|
||||
(BSON umi jen 8-byte ints) — drive cely bulk_write spadl na
|
||||
'MongoDB can only handle up to 8-byte ints' a zahodil celou
|
||||
davku 200 dokumentu (v1.1 beh 8.6. neulozil NIC).
|
||||
flush() ma fallback per-dokument: vadny zaznam zahodi sam,
|
||||
ne celou davku. bool() testovan pred int().
|
||||
Samostatny error log parse_emails_tower_errors.log a
|
||||
stdout log parse_emails_tower.log (drive sdilene s Graph
|
||||
importem — bordel v logu).
|
||||
1.3 2026-06-08 ZACHRANA drive selhavajicich .msg (cca 126 z behu 8.6.):
|
||||
- open_message(): kaskadove otevreni
|
||||
normal -> SUPPRESS_ALL (vadne prilohy) -> +overrideEncoding
|
||||
Resi 'Attachment method missing' i 'not an MSG file'.
|
||||
- raw-OLE fallback: kdyz extract_msg vrati prazdno/� (vnoreny
|
||||
email, codepage 1200 lze byt cp1250/gb2312), klicova pole
|
||||
(subject/sender/body/html) se doctou PRIMO z OLE streamu
|
||||
s kaskadovym dekodovanim (utf-8 strict -> CPID -> cp1250 ...).
|
||||
Hlavickam o kodovani se neveri (casto si protireci).
|
||||
- nova pole: parse_mode (normal/suppress_all/override:ENC),
|
||||
parse_degraded (bool).
|
||||
"""
|
||||
|
||||
import sys
|
||||
import re
|
||||
import logging
|
||||
import argparse
|
||||
import base64
|
||||
import struct
|
||||
from pathlib import Path
|
||||
from datetime import datetime, timezone
|
||||
from typing import Optional
|
||||
|
||||
import extract_msg
|
||||
from extract_msg.enums import ErrorBehavior
|
||||
import olefile
|
||||
from dateutil import parser as dtparser
|
||||
from pymongo import MongoClient, UpdateOne, ASCENDING, TEXT
|
||||
|
||||
if hasattr(sys.stdout, "reconfigure"):
|
||||
sys.stdout.reconfigure(encoding="utf-8", errors="replace")
|
||||
|
||||
# ─── KONFIGURACE ──────────────────────────────────────────────────────────────
|
||||
MSGS_DIR = Path("/mnt/JNJEMAILS")
|
||||
MONGO_URI = "mongodb://192.168.1.76:27017"
|
||||
MONGO_DB = "emaily"
|
||||
MONGO_COL = "vbuzalka@its.jnj.com"
|
||||
BATCH_SIZE = 200
|
||||
LOG_FILE = Path(__file__).parent / "parse_emails_tower_errors.log"
|
||||
SCRIPT_VERSION = "1.2"
|
||||
# ──────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
logging.basicConfig(
|
||||
filename=str(LOG_FILE),
|
||||
level=logging.ERROR,
|
||||
format="%(asctime)s | %(message)s",
|
||||
datefmt="%Y-%m-%d %H:%M:%S",
|
||||
encoding="utf-8",
|
||||
)
|
||||
|
||||
|
||||
# ─── Pomocné funkce ───────────────────────────────────────────────────────────
|
||||
|
||||
def safe(obj, *attrs, default=None):
|
||||
"""Bezpecne cteni atributu — vrati prvni non-None hodnotu."""
|
||||
for attr in attrs:
|
||||
try:
|
||||
val = getattr(obj, attr, None)
|
||||
if val is None:
|
||||
continue
|
||||
if isinstance(val, str) and not val.strip():
|
||||
continue
|
||||
return val
|
||||
except Exception:
|
||||
continue
|
||||
return default
|
||||
|
||||
|
||||
def parse_date(raw) -> Optional[datetime]:
|
||||
"""Libovolny datum -> UTC datetime bez tzinfo (pro MongoDB)."""
|
||||
if raw is None:
|
||||
return None
|
||||
if isinstance(raw, datetime):
|
||||
if raw.tzinfo:
|
||||
return raw.astimezone(timezone.utc).replace(tzinfo=None)
|
||||
return raw
|
||||
try:
|
||||
dt = dtparser.parse(str(raw))
|
||||
if dt.tzinfo:
|
||||
return dt.astimezone(timezone.utc).replace(tzinfo=None)
|
||||
return dt
|
||||
except Exception:
|
||||
return None
|
||||
|
||||
|
||||
_INT64_MIN, _INT64_MAX = -(2 ** 63), 2 ** 63 - 1
|
||||
|
||||
|
||||
def to_bson(val):
|
||||
"""Konvertuje hodnotu na BSON-serializovatelny typ.
|
||||
|
||||
Pozor: BSON umi jen signed int64. Python ma neomezene integery, takze
|
||||
velke MAPI hodnoty (PR_CHANGE_KEY, FILETIME, 64-bit handle) mimo rozsah
|
||||
int64 prevadime na string — jinak cely bulk_write spadne na
|
||||
'MongoDB can only handle up to 8-byte ints'.
|
||||
"""
|
||||
# bool musi byt PRED int (isinstance(True, int) == True)
|
||||
if isinstance(val, bool):
|
||||
return val
|
||||
if isinstance(val, bytes):
|
||||
return val.hex() if len(val) <= 128 else f"<bytes:{len(val)}>"
|
||||
if isinstance(val, datetime):
|
||||
return parse_date(val)
|
||||
if isinstance(val, int):
|
||||
return val if _INT64_MIN <= val <= _INT64_MAX else str(val)
|
||||
if isinstance(val, (str, float, type(None))):
|
||||
return val
|
||||
if isinstance(val, list):
|
||||
return [to_bson(v) for v in val]
|
||||
try:
|
||||
iv = int(val)
|
||||
return iv if _INT64_MIN <= iv <= _INT64_MAX else str(iv)
|
||||
except Exception:
|
||||
pass
|
||||
return str(val)
|
||||
|
||||
|
||||
# ─── Extrakce částí zprávy ────────────────────────────────────────────────────
|
||||
|
||||
def extract_headers(msg) -> dict:
|
||||
headers = {}
|
||||
try:
|
||||
hdr = msg.header
|
||||
if not hdr:
|
||||
return {}
|
||||
from email.header import decode_header as _dh
|
||||
|
||||
def _decode(v: str) -> str:
|
||||
try:
|
||||
parts = _dh(v)
|
||||
out = ""
|
||||
for part, enc in parts:
|
||||
out += part.decode(enc or "utf-8", errors="replace") if isinstance(part, bytes) else part
|
||||
return out
|
||||
except Exception:
|
||||
return v
|
||||
|
||||
for key in set(hdr.keys()):
|
||||
k = key.lower().replace("-", "_")
|
||||
vals = [_decode(v) for v in hdr.get_all(key, [])]
|
||||
headers[k] = vals if len(vals) > 1 else (vals[0] if vals else "")
|
||||
except Exception as e:
|
||||
logging.error("extract_headers: %s", e)
|
||||
return headers
|
||||
|
||||
|
||||
def extract_recipients(msg) -> list:
|
||||
result = []
|
||||
type_map = {1: "to", 2: "cc", 3: "bcc"}
|
||||
try:
|
||||
for r in msg.recipients:
|
||||
rtype = getattr(r, "type", 1)
|
||||
try:
|
||||
rtype = int(rtype)
|
||||
except Exception:
|
||||
try:
|
||||
rtype = int(rtype.value)
|
||||
except Exception:
|
||||
rtype = 1
|
||||
rec = {
|
||||
"type": type_map.get(rtype, "to"),
|
||||
"email": safe(r, "email", default=""),
|
||||
"name": safe(r, "name", default=""),
|
||||
}
|
||||
result.append(rec)
|
||||
except Exception as e:
|
||||
logging.error("extract_recipients: %s", e)
|
||||
return result
|
||||
|
||||
|
||||
def extract_attachments(msg) -> list:
|
||||
result = []
|
||||
try:
|
||||
for att in msg.attachments:
|
||||
fname = safe(att, "longFilename", "shortFilename", default="")
|
||||
if not fname:
|
||||
continue
|
||||
size = 0
|
||||
try:
|
||||
d = att.data
|
||||
size = len(d) if d else 0
|
||||
except Exception:
|
||||
pass
|
||||
result.append({
|
||||
"filename": fname,
|
||||
"size_bytes": size,
|
||||
"mime_type": safe(att, "mimetype", "mimeType", default="application/octet-stream"),
|
||||
"content_id": safe(att, "cid", default=None),
|
||||
"is_inline": bool(safe(att, "isInline", default=False)),
|
||||
})
|
||||
except Exception as e:
|
||||
logging.error("extract_attachments: %s", e)
|
||||
return result
|
||||
|
||||
|
||||
def extract_mapi_props(msg) -> dict:
|
||||
"""Vsechny raw MAPI properties jako {0xXXXX: value}."""
|
||||
result = {}
|
||||
try:
|
||||
props = msg.props
|
||||
if not hasattr(props, "items"):
|
||||
return {}
|
||||
for key, prop in props.items():
|
||||
try:
|
||||
val = to_bson(prop.value)
|
||||
prop_id = f"0x{key[:4].upper()}" if len(key) >= 4 else f"0x{key.upper()}"
|
||||
result[prop_id] = val
|
||||
except Exception:
|
||||
pass
|
||||
except Exception as e:
|
||||
logging.error("extract_mapi_props: %s", e)
|
||||
return result
|
||||
|
||||
|
||||
# ─── Tolerantní otevírání a raw-OLE fallback ─────────────────────────────────
|
||||
#
|
||||
# Nektere .msg extract_msg neumi: (a) vadna priloha bez PR_ATTACH_METHOD,
|
||||
# (b) telo deklaruje codepage 1200 (UTF-16) ale bajty jsou cp1250/gb2312,
|
||||
# (c) vnoreny email ("not an MSG file") — extract_msg vrati prazdne pole.
|
||||
# Data v souboru ale jsou. Otevreme tolerantne a degradovana textova pole
|
||||
# docteme PRIMO z OLE streamu s kaskadovym dekodovanim (hlavickam se neveri).
|
||||
|
||||
# Windows codepage -> python codec (PR_INTERNET_CPID / PR_MESSAGE_CODEPAGE)
|
||||
_CPID_TO_CODEC = {
|
||||
1250: "cp1250", 1251: "cp1251", 1252: "cp1252", 1253: "cp1253",
|
||||
1254: "cp1254", 1255: "cp1255", 1256: "cp1256", 1257: "cp1257",
|
||||
1258: "cp1258", 874: "cp874", 932: "shift_jis", 936: "gb2312",
|
||||
949: "euc_kr", 950: "big5", 65001: "utf-8", 28591: "iso-8859-1",
|
||||
28592: "iso-8859-2", 20127: "ascii",
|
||||
}
|
||||
|
||||
|
||||
def _read_u32_prop(ole, propid):
|
||||
"""Precte 32-bit hodnotu MAPI property z top-level __properties_version1.0."""
|
||||
try:
|
||||
data = ole.openstream("__properties_version1.0").read()
|
||||
except Exception:
|
||||
return None
|
||||
body = data[32:] # 32-bajtova hlavicka top-level property streamu
|
||||
for i in range(0, len(body) - 16 + 1, 16):
|
||||
rec = body[i:i + 16]
|
||||
tag = struct.unpack("<I", rec[0:4])[0]
|
||||
if ((tag >> 16) & 0xFFFF) == propid:
|
||||
return struct.unpack("<I", rec[8:12])[0]
|
||||
return None
|
||||
|
||||
|
||||
def _detect_cpid(ole) -> Optional[str]:
|
||||
"""Codec dle PR_INTERNET_CPID / PR_MESSAGE_CODEPAGE (jako napoveda, ne dogma)."""
|
||||
for pid in (0x3FDE, 0x3FFD): # INTERNET_CPID, MESSAGE_CODEPAGE
|
||||
codec = _CPID_TO_CODEC.get(_read_u32_prop(ole, pid))
|
||||
# utf-8/ascii nejsou dobry hint pro 8-bit stream (casto lzou)
|
||||
if codec and codec not in ("utf-8", "ascii"):
|
||||
return codec
|
||||
return None
|
||||
|
||||
|
||||
def _cascade_decode(raw: bytes, is_unicode: bool, cpid_codec: Optional[str]) -> str:
|
||||
"""Dekoduje bajty MAPI stringu. Hlavickam se neveri — zkousime striktne
|
||||
v poradi priorit a vezmeme prvni, co projde bez chyby."""
|
||||
if not raw:
|
||||
return ""
|
||||
if is_unicode: # PT_UNICODE = utf-16-le
|
||||
try:
|
||||
return raw.decode("utf-16-le")
|
||||
except Exception:
|
||||
return raw.decode("utf-16-le", errors="replace")
|
||||
order = ["utf-8"] # utf-8 strict = silny rozlisovac
|
||||
if cpid_codec:
|
||||
order.append(cpid_codec)
|
||||
order += ["cp1250", "cp1252", "gb2312", "big5"]
|
||||
for enc in order:
|
||||
try:
|
||||
return raw.decode(enc, errors="strict")
|
||||
except Exception:
|
||||
continue
|
||||
return raw.decode("latin-1", errors="replace") # nikdy nespadne
|
||||
|
||||
|
||||
def _raw_mapi_strings(msg_path: Path) -> dict:
|
||||
"""Cte klicova textova MAPI pole PRIMO z OLE (mimo extract_msg).
|
||||
Pouzije se jen kdyz extract_msg vrati degradovane pole."""
|
||||
out = {"subject": "", "normalized_subject": "", "sender_name": "",
|
||||
"sender_email": "", "sender_smtp": "", "body_text": "", "body_html": ""}
|
||||
try:
|
||||
ole = olefile.OleFileIO(str(msg_path))
|
||||
except Exception:
|
||||
return out
|
||||
try:
|
||||
cpid = _detect_cpid(ole)
|
||||
wanted = { # MAPI tag -> klic v out
|
||||
"0037": "subject", "0E1D": "normalized_subject",
|
||||
"0C1A": "sender_name", "5D01": "sender_smtp",
|
||||
"0C1F": "sender_email", "1000": "body_text", "1013": "body_html",
|
||||
}
|
||||
prefix = "__substg1.0_"
|
||||
found = {} # key -> (priorita_typu, hodnota)
|
||||
for entry in ole.listdir():
|
||||
if len(entry) != 1: # jen top-level (ne vnorene zpravy)
|
||||
continue
|
||||
name = entry[0]
|
||||
if not name.startswith(prefix):
|
||||
continue
|
||||
tag = name[len(prefix):len(prefix) + 4].upper()
|
||||
key = wanted.get(tag)
|
||||
if not key:
|
||||
continue
|
||||
typ = name[-4:].upper()
|
||||
prio = {"001F": 3, "001E": 2, "0102": 1}.get(typ, 0)
|
||||
if prio == 0:
|
||||
continue
|
||||
prev = found.get(key)
|
||||
if prev and prev[0] >= prio: # preferuj unicode > ansi > binarni
|
||||
continue
|
||||
try:
|
||||
raw = ole.openstream(entry).read()
|
||||
val = _cascade_decode(raw, typ == "001F", cpid)
|
||||
except Exception:
|
||||
continue
|
||||
found[key] = (prio, val)
|
||||
for key, (_, val) in found.items():
|
||||
out[key] = val
|
||||
finally:
|
||||
ole.close()
|
||||
return out
|
||||
|
||||
|
||||
def _degraded(s) -> bool:
|
||||
"""Pole je degradovane: prazdne nebo obsahuje U+FFFD (nahradni znak)."""
|
||||
return (not s) or ("�" in s)
|
||||
|
||||
|
||||
def open_message(msg_path: Path):
|
||||
"""Kaskadove otevreni .msg -> (msg, mode) nebo (None, None).
|
||||
normal bezna cesta
|
||||
suppress_all tolerantni k vadnym prilohum
|
||||
override:ENC tolerantni + vnuceny encoding dle codepage property
|
||||
"""
|
||||
try:
|
||||
return extract_msg.Message(str(msg_path)), "normal"
|
||||
except Exception:
|
||||
pass
|
||||
try:
|
||||
return extract_msg.Message(
|
||||
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL), "suppress_all"
|
||||
except Exception:
|
||||
pass
|
||||
encs = []
|
||||
try:
|
||||
ole = olefile.OleFileIO(str(msg_path))
|
||||
c = _detect_cpid(ole)
|
||||
ole.close()
|
||||
if c:
|
||||
encs.append(c)
|
||||
except Exception:
|
||||
pass
|
||||
for e in encs + ["cp1250", "cp1252"]:
|
||||
try:
|
||||
return extract_msg.Message(
|
||||
str(msg_path), errorBehavior=ErrorBehavior.SUPPRESS_ALL,
|
||||
overrideEncoding=e), f"override:{e}"
|
||||
except Exception:
|
||||
continue
|
||||
return None, None
|
||||
|
||||
|
||||
# ─── Hlavní extrakce ─────────────────────────────────────────────────────────
|
||||
|
||||
def extract_message(msg_path: Path) -> Optional[dict]:
|
||||
"""Parsuje jeden .msg soubor -> MongoDB dokument."""
|
||||
msg, parse_mode = open_message(msg_path)
|
||||
if msg is None:
|
||||
logging.error("open failed [%s]: vsechny pokusy o otevreni selhaly", msg_path.name)
|
||||
return None
|
||||
|
||||
try:
|
||||
# ── Message-ID ────────────────────────────────────────────────
|
||||
mid = None
|
||||
for attr in ("messageId", "message_id", "internetMessageId"):
|
||||
mid = safe(msg, attr)
|
||||
if mid:
|
||||
break
|
||||
if not mid:
|
||||
mid = f"filename:{msg_path.stem}"
|
||||
mid = str(mid).strip()
|
||||
|
||||
# ── Předmět ───────────────────────────────────────────────────
|
||||
try:
|
||||
subject = msg.subject or ""
|
||||
except Exception:
|
||||
subject = ""
|
||||
|
||||
normalized_subject = safe(msg, "normalizedSubject", "normalized_subject", default="")
|
||||
|
||||
# ── Tělo ──────────────────────────────────────────────────────
|
||||
try:
|
||||
body_text = msg.body or ""
|
||||
except Exception:
|
||||
body_text = ""
|
||||
|
||||
body_html = None
|
||||
try:
|
||||
bh = msg.htmlBody
|
||||
if isinstance(bh, bytes):
|
||||
bh = bh.decode("utf-8", errors="replace")
|
||||
if bh:
|
||||
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# ── Odesílatel ────────────────────────────────────────────────
|
||||
try:
|
||||
sender_email = msg.sender or ""
|
||||
except Exception:
|
||||
sender_email = ""
|
||||
|
||||
sender_name = safe(msg, "senderName", "sender_name", default="")
|
||||
sender_smtp = safe(msg, "senderSmtpAddress", "sent_representing_smtp_address", default="")
|
||||
|
||||
# ── Příjemci ──────────────────────────────────────────────────
|
||||
recipients = extract_recipients(msg)
|
||||
|
||||
try:
|
||||
to_raw = msg.to or ""
|
||||
except Exception:
|
||||
to_raw = ""
|
||||
try:
|
||||
cc_raw = msg.cc or ""
|
||||
except Exception:
|
||||
cc_raw = ""
|
||||
try:
|
||||
bcc_raw = getattr(msg, "bcc", None) or ""
|
||||
except Exception:
|
||||
bcc_raw = ""
|
||||
|
||||
display_to = safe(msg, "displayTo", "display_to", default="")
|
||||
display_cc = safe(msg, "displayCc", "display_cc", default="")
|
||||
|
||||
# ── Časy ──────────────────────────────────────────────────────
|
||||
try:
|
||||
received_at = parse_date(msg.date)
|
||||
except Exception:
|
||||
received_at = None
|
||||
|
||||
sent_at = None
|
||||
for attr in ("clientSubmitTime", "client_submit_time", "sentOn"):
|
||||
v = safe(msg, attr)
|
||||
if v:
|
||||
sent_at = parse_date(v)
|
||||
break
|
||||
|
||||
# ── MAPI vlastnosti ───────────────────────────────────────────
|
||||
importance = 1
|
||||
try:
|
||||
v = msg.importance
|
||||
if v is not None:
|
||||
importance = int(v)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
sensitivity = 0
|
||||
try:
|
||||
v = getattr(msg, "sensitivity", None)
|
||||
if v is not None:
|
||||
sensitivity = int(v)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
flag_status = 0
|
||||
try:
|
||||
v = safe(msg, "flagStatus", "flag_status")
|
||||
if v is not None:
|
||||
flag_status = int(v)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
conversation_topic = safe(msg, "conversationTopic", "conversation_topic", default="")
|
||||
|
||||
conversation_index = ""
|
||||
try:
|
||||
ci = safe(msg, "conversationIndex", "conversation_index")
|
||||
if isinstance(ci, bytes):
|
||||
conversation_index = base64.b64encode(ci).decode()
|
||||
elif ci:
|
||||
conversation_index = str(ci)
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
in_reply_to = safe(msg, "inReplyTo", "in_reply_to", default="")
|
||||
|
||||
internet_refs = []
|
||||
try:
|
||||
refs = safe(msg, "internetReferences", "internet_references")
|
||||
if isinstance(refs, list):
|
||||
internet_refs = refs
|
||||
elif isinstance(refs, str) and refs:
|
||||
internet_refs = [r.strip() for r in refs.split() if r.strip()]
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
categories = []
|
||||
try:
|
||||
cats = safe(msg, "categories")
|
||||
if isinstance(cats, list):
|
||||
categories = [str(c) for c in cats if c]
|
||||
elif isinstance(cats, str) and cats:
|
||||
categories = [c.strip() for c in re.split(r"[;,]", cats) if c.strip()]
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
read_receipt = bool(safe(msg, "readReceiptRequested", "read_receipt_requested", default=False))
|
||||
delivery_receipt = bool(safe(msg, "deliveryReceiptRequested", "delivery_receipt_requested", default=False))
|
||||
|
||||
# ── Internet headers ──────────────────────────────────────────
|
||||
headers = extract_headers(msg)
|
||||
|
||||
if not in_reply_to:
|
||||
in_reply_to = headers.get("in_reply_to", "")
|
||||
if not internet_refs:
|
||||
refs_str = headers.get("references", "")
|
||||
if isinstance(refs_str, str) and refs_str:
|
||||
internet_refs = [r.strip() for r in refs_str.split() if r.strip()]
|
||||
|
||||
# ── Přílohy ───────────────────────────────────────────────────
|
||||
attachments = extract_attachments(msg)
|
||||
|
||||
# ── Raw MAPI ──────────────────────────────────────────────────
|
||||
mapi_raw = extract_mapi_props(msg)
|
||||
|
||||
msg.close()
|
||||
|
||||
# ── Raw-OLE fallback pro degradovana textova pole ─────────────
|
||||
# Kdyz extract_msg vratil prazdno/� nebo musel hadat encoding
|
||||
# (override/suppress), docteme klicova pole primo z OLE streamu
|
||||
# kaskadovym dekodovanim — spolehlivejsi nez jeden vnuceny encoding.
|
||||
parse_degraded = parse_mode != "normal"
|
||||
# v non-normal modu byl encoding hadany -> raw kaskade se veri vic
|
||||
forced = parse_mode != "normal"
|
||||
if (forced or _degraded(subject) or _degraded(body_text)
|
||||
or _degraded(sender_email) or (body_html and "�" in body_html)):
|
||||
raw = _raw_mapi_strings(msg_path)
|
||||
if raw["subject"] and (forced or _degraded(subject)):
|
||||
subject = raw["subject"]
|
||||
if raw["normalized_subject"] and (forced or _degraded(normalized_subject)):
|
||||
normalized_subject = raw["normalized_subject"]
|
||||
if raw["body_text"] and (forced or _degraded(body_text)):
|
||||
body_text = raw["body_text"]
|
||||
if raw["body_html"] and (forced or not body_html or "�" in body_html):
|
||||
bh = raw["body_html"]
|
||||
body_html = bh if len(bh) <= 2 * 1024 * 1024 else bh[:2 * 1024 * 1024]
|
||||
if (raw["sender_smtp"] or raw["sender_email"]) and (forced or _degraded(sender_email)):
|
||||
sender_email = raw["sender_smtp"] or raw["sender_email"]
|
||||
if raw["sender_name"] and (forced or _degraded(sender_name)):
|
||||
sender_name = raw["sender_name"]
|
||||
if raw["sender_smtp"] and not sender_smtp:
|
||||
sender_smtp = raw["sender_smtp"]
|
||||
|
||||
# ── Dokument ──────────────────────────────────────────────────
|
||||
return {
|
||||
"_id": mid,
|
||||
"filename": msg_path.name,
|
||||
|
||||
"subject": subject,
|
||||
"normalized_subject": normalized_subject,
|
||||
"importance": importance,
|
||||
"sensitivity": sensitivity,
|
||||
"flag_status": flag_status,
|
||||
"read_receipt_requested": read_receipt,
|
||||
"delivery_receipt_requested": delivery_receipt,
|
||||
"has_attachments": len(attachments) > 0,
|
||||
"attachment_count": len(attachments),
|
||||
"message_size_bytes": msg_path.stat().st_size,
|
||||
|
||||
"conversation_topic": conversation_topic,
|
||||
"conversation_index": conversation_index,
|
||||
"in_reply_to": in_reply_to,
|
||||
"internet_references": internet_refs,
|
||||
"categories": categories,
|
||||
|
||||
"received_at": received_at,
|
||||
"sent_at": sent_at,
|
||||
|
||||
"sender": {
|
||||
"email": sender_email,
|
||||
"name": sender_name,
|
||||
"smtp": sender_smtp,
|
||||
},
|
||||
"to": to_raw,
|
||||
"cc": cc_raw,
|
||||
"bcc": bcc_raw,
|
||||
"display_to": display_to,
|
||||
"display_cc": display_cc,
|
||||
"recipients": recipients,
|
||||
|
||||
"body_text": body_text,
|
||||
"body_html": body_html,
|
||||
|
||||
"attachments": attachments,
|
||||
"headers": headers,
|
||||
"mapi": mapi_raw,
|
||||
|
||||
"parse_mode": parse_mode, # normal / suppress_all / override:ENC
|
||||
"parse_degraded": parse_degraded, # True = pouzit fallback (vadna priloha/encoding)
|
||||
|
||||
"parsed_at": datetime.now(timezone.utc).replace(tzinfo=None),
|
||||
}
|
||||
|
||||
except Exception as e:
|
||||
logging.error("extract_message failed [%s]: %s", msg_path.name, e)
|
||||
return None
|
||||
|
||||
|
||||
# ─── MongoDB indexy ───────────────────────────────────────────────────────────
|
||||
|
||||
def create_indexes(col):
|
||||
print(" Vytvarim indexy...")
|
||||
col.create_index([("received_at", ASCENDING)])
|
||||
col.create_index([("sent_at", ASCENDING)])
|
||||
col.create_index([("sender.email", ASCENDING)])
|
||||
col.create_index([("filename", ASCENDING)], unique=True, sparse=True)
|
||||
col.create_index([("conversation_topic", ASCENDING)])
|
||||
col.create_index([("has_attachments", ASCENDING)])
|
||||
col.create_index([("categories", ASCENDING)])
|
||||
col.create_index([("importance", ASCENDING)])
|
||||
col.create_index([("flag_status", ASCENDING)])
|
||||
col.create_index([
|
||||
("subject", TEXT),
|
||||
("body_text", TEXT),
|
||||
("to", TEXT),
|
||||
("cc", TEXT),
|
||||
], name="text_search", default_language="none")
|
||||
print(" Indexy hotovy.")
|
||||
|
||||
|
||||
# ─── MAIN ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
ap = argparse.ArgumentParser(description=f"parse_emails v{SCRIPT_VERSION}")
|
||||
ap.add_argument("--msgs-dir", default=str(MSGS_DIR),
|
||||
help="Cesta k .msg souborum")
|
||||
ap.add_argument("--limit", type=int, default=0,
|
||||
help="Zpracovat max N souboru (0 = vse)")
|
||||
ap.add_argument("--skip-existing", action="store_true",
|
||||
help="Preskocit soubory ktere jiz jsou v MongoDB (pokracovani)")
|
||||
ap.add_argument("--no-indexes", action="store_true",
|
||||
help="Nevytvorit indexy na konci")
|
||||
args = ap.parse_args()
|
||||
|
||||
msgs_dir = Path(args.msgs_dir)
|
||||
start = datetime.now()
|
||||
|
||||
print(f"=== parse_emails v{SCRIPT_VERSION} ===")
|
||||
print(f"Start: {start.strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
print(f"Zdroj: {msgs_dir}")
|
||||
print(f"MongoDB: {MONGO_URI} -> {MONGO_DB}.{MONGO_COL}")
|
||||
|
||||
# MongoDB
|
||||
client = MongoClient(MONGO_URI, serverSelectionTimeoutMS=5000)
|
||||
try:
|
||||
client.admin.command("ping")
|
||||
print(" MongoDB OK")
|
||||
except Exception as e:
|
||||
print(f" CHYBA: MongoDB neni dostupna -- {e}")
|
||||
sys.exit(1)
|
||||
|
||||
col = client[MONGO_DB][MONGO_COL]
|
||||
|
||||
# Skip existing — nacti seznam uz importovanych souboru
|
||||
existing: set = set()
|
||||
if args.skip_existing:
|
||||
print(" Nacitam existujici zaznamy z MongoDB...")
|
||||
existing = set(col.distinct("filename"))
|
||||
print(f" {len(existing)} jiz importovano")
|
||||
|
||||
# Scan
|
||||
print(f"\nSkenuji {msgs_dir} ...")
|
||||
all_files = sorted(msgs_dir.glob("*.msg"))
|
||||
if args.limit:
|
||||
all_files = all_files[:args.limit]
|
||||
|
||||
to_process = [f for f in all_files if f.name not in existing]
|
||||
skipped = len(all_files) - len(to_process)
|
||||
total = len(to_process)
|
||||
|
||||
print(f" Celkem .msg: {len(all_files)}")
|
||||
print(f" Preskoceno: {skipped}")
|
||||
print(f" Ke zpracovani: {total}\n")
|
||||
|
||||
if total == 0:
|
||||
print("Neni co importovat.")
|
||||
client.close()
|
||||
return
|
||||
|
||||
batch = []
|
||||
ok_count = 0
|
||||
err_count = 0
|
||||
|
||||
def flush():
|
||||
nonlocal ok_count, err_count
|
||||
if not batch:
|
||||
return
|
||||
try:
|
||||
col.bulk_write(batch, ordered=False)
|
||||
except Exception as e:
|
||||
# Cely batch spadl (typicky jeden vadny dokument). Zkusime
|
||||
# ho zapsat dokument po dokumentu, aby chyba zahodila jen
|
||||
# skutecne vadny zaznam, ne celych BATCH_SIZE.
|
||||
logging.error("bulk_write spadl (%s) -- prepinam na per-dokument", e)
|
||||
print(f" CHYBA bulk_write: {e} -- zkousim per-dokument")
|
||||
for op in batch:
|
||||
try:
|
||||
col.bulk_write([op], ordered=False)
|
||||
except Exception as e2:
|
||||
try:
|
||||
bad_id = getattr(op, "_filter", {}).get("_id", "?")
|
||||
except Exception:
|
||||
bad_id = "?"
|
||||
logging.error("per-dokument selhal [_id=%s]: %s", bad_id, e2)
|
||||
print(f" ZAHOZEN _id={bad_id}: {e2}")
|
||||
ok_count -= 1
|
||||
err_count += 1
|
||||
batch.clear()
|
||||
|
||||
for i, msg_path in enumerate(to_process, 1):
|
||||
doc = extract_message(msg_path)
|
||||
|
||||
if doc is None:
|
||||
err_count += 1
|
||||
else:
|
||||
batch.append(UpdateOne({"_id": doc["_id"]}, {"$set": doc}, upsert=True))
|
||||
ok_count += 1
|
||||
|
||||
if len(batch) >= BATCH_SIZE:
|
||||
flush()
|
||||
|
||||
# Výpis každého emailu
|
||||
status = "ERR " if doc is None else "OK "
|
||||
subject_str = (doc.get("subject") or "")[:60] if doc else "?"
|
||||
sender_str = (doc.get("sender", {}).get("email") or "")[:40] if doc else "?"
|
||||
print(f" {i:>6}/{total} {status} {subject_str:<60} {sender_str}")
|
||||
|
||||
if i % 500 == 0:
|
||||
elapsed = (datetime.now() - start).total_seconds()
|
||||
rate = i / elapsed if elapsed > 0 else 0
|
||||
eta_s = int((total - i) / rate) if rate > 0 else 0
|
||||
print(f" {'─'*80}")
|
||||
print(f" Průběh: ok={ok_count} err={err_count} "
|
||||
f"{rate:.1f} msg/s ETA {eta_s//3600}h{(eta_s%3600)//60}m")
|
||||
print(f" {'─'*80}")
|
||||
|
||||
flush()
|
||||
|
||||
elapsed_total = (datetime.now() - start).total_seconds()
|
||||
print(f"\n{'='*52}")
|
||||
print(f"Vysledek: ok={ok_count} | skip={skipped} | err={err_count}")
|
||||
print(f"Celkovy cas: {int(elapsed_total//3600)}h {int((elapsed_total%3600)//60)}m {int(elapsed_total%60)}s")
|
||||
print(f"Dokumentu v kolekci: {col.count_documents({})}")
|
||||
|
||||
if not args.no_indexes:
|
||||
print()
|
||||
create_indexes(col)
|
||||
|
||||
print(f"\nKonec: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
|
||||
if err_count:
|
||||
print(f"Chyby logovany do: {LOG_FILE}")
|
||||
|
||||
client.close()
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,80 @@
|
||||
# jnj_tower_ingest v1.1.0
|
||||
|
||||
**Soubor:** `jnj_tower_ingest_v1.1.py`
|
||||
**Datum:** 2026-06-10
|
||||
**Autor:** vladimir.buzalka
|
||||
**Běží:** Docker kontejner `python-runner` na Unraid Tower (192.168.1.76), u MongoDB.
|
||||
|
||||
## Co to je
|
||||
|
||||
Sjednocený **Tower-side ingest** JNJ e-mailů — tři dříve oddělené části v jednom běhu:
|
||||
|
||||
| Fáze | Dříve samostatně | Co dělá |
|
||||
|---|---|---|
|
||||
| **1. PARSE** | `parse_emails_tower_v1.3.py` | `.msg` z `/mnt/JNJEMAILS` → dokument v Mongo `emaily."vbuzalka@its.jnj.com"` (tělo, přílohy, hlavičky, MAPI). Inkrementálně přes **mtime watermark** (`jnj_sync_state`/`_id="parse_state"`). |
|
||||
| **2. SYNC** | `sync_jnj_state_v1.0.py` | nejnovější SQLite (read-only) → zrcadlo `jnj_messages` + doplnění `jnj_folder`/stavu do `emaily`. Watermark `updated_at` + zkratka `last_db`. |
|
||||
| **3. ENRICH** | `jnj_emails_to_fulltext_v1.0.py` | doindexuje JNJ schránku do **PG fulltextu** zavoláním **sdíleného** `5_enrich_fulltext_emails_vX.Y.py --mailbox vbuzalka@its.jnj.com` (stejný extractor jako Graph pipeline → konzistentní schéma). |
|
||||
|
||||
**Pořadí: parse → sync → enrich.** Čerstvě naparsovaný mail dostane v jednom běhu tělo
|
||||
(parse) + cestu (sync) + fulltext (enrich). Klíč všude = Internet Message-ID = Mongo `_id`.
|
||||
|
||||
## Inkrementálnost (cron každých 5 min)
|
||||
|
||||
- **PARSE** — jen `.msg` s `mtime > parse_state.last_parse_mtime`. 1. běh = seed dle
|
||||
filename v Mongu, pak čistě mtime. `--full` reparsuje vše. Indexy jen při full/seed/`--reindex`.
|
||||
- **SYNC** — watermark `updated_at` + zkratka `last_db` (stejná SQLite → no-op).
|
||||
- **ENRICH** — spustí se **jen když parse přidal nové dokumenty** (jinak přeskočí — JNJ
|
||||
stejně enrichuje hlavní Graph pipeline v 6:00/18:00). Verze enrich se **auto-detekuje**
|
||||
(nejnovější `/scripts/5_enrich_fulltext_emails_v*.py`). `--no-enrich` vypne,
|
||||
`--enrich-always` vynutí.
|
||||
|
||||
Tři nezávislé události (nová `.msg` / nová `.db` / nové doc pro PG) → skript udělá jen to,
|
||||
co má práci; jinak levný no-op.
|
||||
|
||||
## Vztah ke Graph pipeline
|
||||
|
||||
Hlavní `0_run_pipeline` (Graph API) zpracovává schránky buzalka.cz a **JNJ přeskakuje**
|
||||
(`SKIP_MAILBOXES`, žádné API). JNJ řeší tenhle skript přes `.msg`. Obě cesty ústí do téhož
|
||||
Monga `emaily` a přes **sdílený `5_enrich`** do téhož PG `MongoEmaily.emails`. Servisní
|
||||
kolekce `jnj_messages` + `jnj_sync_state` jsou v enrich `NON_MAILBOX_COLLECTIONS`
|
||||
(nejsou schránky → nejdou do PG).
|
||||
|
||||
## Argumenty
|
||||
|
||||
| Argument | Význam |
|
||||
|---|---|
|
||||
| `--dry-run` | nic nezapíše, jen plán všech fází |
|
||||
| `--full` | parse: reparsuj vše; sync: ignoruj watermark; enrich: vynuť |
|
||||
| `--limit N` | max N souborů (parse) / řádků (sync) |
|
||||
| `--reindex` | vynutí indexy po parse |
|
||||
| `--force` | sync: ignoruj `last_db` |
|
||||
| `--parse-only` / `--sync-only` / `--enrich-only` | jen daná fáze |
|
||||
| `--no-enrich` | přeskoč enrich |
|
||||
| `--enrich-always` | spusť enrich i bez nových dokumentů |
|
||||
|
||||
## Spouštění
|
||||
|
||||
```bash
|
||||
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.1.py --dry-run
|
||||
docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.1.py # cron
|
||||
docker exec -it python-runner python3 /scripts/jnj_tower_ingest_v1.1.py --enrich-only
|
||||
```
|
||||
|
||||
## Plánování (HOTOVO)
|
||||
|
||||
Unraid User Scripts úloha `jnj_state_sync` (cron `*/5 * * * *`) — wrapper s `flock` volá
|
||||
`docker exec python-runner python3 /scripts/jnj_tower_ingest_v1.1.py`. Loguje jen reálnou
|
||||
práci/chyby do `/mnt/user/Scripts/logs/jnj_tower_ingest.log`
|
||||
(grep `Zapisuji|PARSE hotovo|SYNC hotovo|ENRICH hotovo|CHYBA|Traceback`).
|
||||
|
||||
## Revert
|
||||
|
||||
`jnj_tower_ingest_v1.0.py` (bez enrich) + `parse_emails_tower_v1.3.py` +
|
||||
`sync_jnj_state_v1.0.py` zůstávají v `/scripts/` jako pojistka. Návrat = přepsat wrapper
|
||||
zpět. `jnj_emails_to_fulltext` přesunut do Trash (nahrazen fází 3).
|
||||
|
||||
## Historie verzí
|
||||
|
||||
- **1.0.0** 2026-06-10 — sjednocení parse + sync (mtime watermark, pořadí parse→sync).
|
||||
- **1.1.0** 2026-06-10 — + fáze ENRICH (sdílený `5_enrich --mailbox`, auto-detekce verze,
|
||||
jen při nových dokumentech). Nahrazuje `jnj_emails_to_fulltext_v1.0`.
|
||||
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,55 @@
|
||||
# -*- coding: utf-8 -*-
|
||||
# =============================================================================
|
||||
# Nazev: fix_email_podruhe_v1.0.py
|
||||
# Verze: 1.0
|
||||
# Datum: 2026-06-10
|
||||
# Popis: U center v KROK 1, jejichz STATUS obsahuje "Email odeslán podruhé",
|
||||
# nahradi tento text za "1. připomínka odeslaná" (= 2. email byl
|
||||
# fakticky 1. pripominka). Po zapisu spustit classify_krok --apply
|
||||
# (centra prejdou na KROK 2). Idempotentni.
|
||||
# Pouziti: python fix_email_podruhe_v1.0.py (dry-run)
|
||||
# python fix_email_podruhe_v1.0.py --apply (zapise)
|
||||
# =============================================================================
|
||||
|
||||
import os
|
||||
import sys
|
||||
from pymongo import MongoClient
|
||||
|
||||
MONGO_URI = os.environ.get("MONGO_URI", "mongodb://192.168.1.76:27017")
|
||||
OLD = "Email odeslán podruhé"
|
||||
NEW = "1. připomínka odeslaná"
|
||||
|
||||
|
||||
def main():
|
||||
apply = "--apply" in sys.argv
|
||||
client = MongoClient(MONGO_URI)
|
||||
col = client["feasibility"]["investigators"]
|
||||
|
||||
docs = list(col.find(
|
||||
{"KROK": {"$regex": "^1"}, "STATUS": {"$regex": "odeslán podruhé"}},
|
||||
{"prijmeni": 1, "jmeno": 1, "STATUS": 1},
|
||||
))
|
||||
print(f"Nalezeno {len(docs)} center v KROK 1 s '{OLD}'.\n")
|
||||
|
||||
n = 0
|
||||
for d in docs:
|
||||
status = d.get("STATUS", "") or ""
|
||||
new_status = status.replace(OLD, NEW)
|
||||
if new_status == status:
|
||||
print(f"[SKIP] {d.get('prijmeni')} {d.get('jmeno')}: text nenalezen")
|
||||
continue
|
||||
print(f"[OK] {d.get('prijmeni')} {d.get('jmeno')}:")
|
||||
print(f" '{status.splitlines()[0]}' -> '{new_status.splitlines()[0]}'")
|
||||
if apply:
|
||||
res = col.update_one({"_id": d["_id"]}, {"$set": {"STATUS": new_status}})
|
||||
n += res.modified_count
|
||||
|
||||
print()
|
||||
if apply:
|
||||
print(f">>> ZAPSANO: {n} zaznamu. Ted spust classify_krok_v1.0.py --apply")
|
||||
else:
|
||||
print(">>> DRY-RUN. Pro zapis spust s --apply")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,39 @@
|
||||
<!--
|
||||
=============================================================================
|
||||
Nazev: sipiq_email_template_v1.0.html
|
||||
Verze: 1.0
|
||||
Datum: 2026-06-10
|
||||
Popis: Schvalena sablona SIPIQ feasibility e-mailu (studie 77242113UCO3002 / DAWN).
|
||||
Pouziti pres MCP vbcz-email create_draft_eml.
|
||||
Placeholdery (nahradit pred generovanim draftu):
|
||||
{{LINK}} - jedinecny SIPIQ Qualtrics odkaz centra (z Trilium note "SIPIQ", noteId hAMNUnUQdCRn)
|
||||
POZOR: v <a href="..."> musi byt & jako &
|
||||
{{DEADLINE}} - termin vyplneni, format DD-MON-YYYY (napr. 17-JUN-2026); pravidlo = den odeslani + 7 dni
|
||||
|
||||
Fixni parametry create_draft_eml:
|
||||
to = adresa lekare (overit z realne korespondence v JNJ schrance vbuzalka@its.jnj.com)
|
||||
cc = AKocourk@ITS.JNJ.com, EBartoso@its.jnj.com
|
||||
subject = 77242113UCO3002/Feasibility dotaznik
|
||||
add_signature = false (podpis je primo v tele nize)
|
||||
from_addr = vychozi (vbuzalka@its.jnj.com; na JNJ PC se doplni automaticky)
|
||||
output_dir = u:\Dropbox\!!!Days\Downloads Z230\UploadToJNJ
|
||||
filename = sipiq_<prijmeni>_<DDMONYYYY>.eml
|
||||
|
||||
Po odeslani -> zapis do Mongo feasibility.investigators (per _id):
|
||||
KROK = "6 - SIPIQ odeslan"
|
||||
sipiq.link, sipiq.link_token (cast Q_DL), sipiq.link_stored_at, sipiq.link_source="Trilium SIPIQ note"
|
||||
STATUS prepend: "<DDMONYYYY>: SIPIQ odeslan (deadline {{DEADLINE}}; <adresa>)"
|
||||
|
||||
Specialni pravidlo: Stepek -> posilat na OBA jeho e-maily.
|
||||
=============================================================================
|
||||
-->
|
||||
<p>Dobrý den,</p>
|
||||
<p>ve společnosti Johnson & Johnson posuzujeme centra zvažovaná pro studie rané fáze vývoje. Prvním krokem je vyplnění dotazníku SIPIQ (Site Interest Protocol Information Questionnaire), díky kterému lépe porozumíme postupům, zásadám a možnostem vašeho centra.</p>
|
||||
<p>Níže najdete odkaz na dotazník SIPIQ specifický pro Vaše centrum. Vyplněný dotazník prosím odešlete do <b>{{DEADLINE}}</b>.</p>
|
||||
<p>Odkaz: <a href="{{LINK}}">{{LINK}}</a></p>
|
||||
<p>Moc prosím vyplňte formulář pečlivě, neuvádějte ani příliš optimistická, ani příliš pesimistická čísla. Na konci dotazníku jsou dotazy na etickou komisi — tyto s přehledem ignorujte, protože situace stran etické komise je nám jasná; vše se podává v rámci centralizovaného EU podání, jehož součástí je i centrální etická komise příslušné země.</p>
|
||||
<p>Naopak nás velice zajímá dotaz ke konci, jak dlouho odhadujete, že bude trvat vyjednávání smlouvy — uveďte to prosím na základě svých zkušeností z předchozích studií.</p>
|
||||
<p>Po vyplnění bude následovat hodnoticí návštěva v centru a finální rozhodnutí o výběru centra.</p>
|
||||
<p>V případě dotazů se na nás neváhejte obrátit.</p>
|
||||
<p>S pozdravem,</p>
|
||||
<p>MUDr. Vladimír BUZALKA<br>ICON plc<br>Performing Local Trial Management Services for Janssen – Cilag s.r.o.<br>Global Clinical Operations<br>Mobile: +420 775 735 276<br>Fax: +420 227 012 284<br>E-mail: vbuzalka@its.jnj.com, vladimir.buzalka@iconplc.com</p>
|
||||
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,649 @@
|
||||
import os
|
||||
import sys
|
||||
import pandas as pd
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
from openpyxl import load_workbook
|
||||
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
|
||||
from openpyxl.utils import get_column_letter
|
||||
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
from common.mongo_writer import get_db
|
||||
|
||||
STUDIES = ["77242113UCO3001", "42847922MDD3003"]
|
||||
|
||||
BASE_DIR = Path(os.path.dirname(os.path.abspath(__file__)))
|
||||
OUTPUT_DIR = BASE_DIR / "output"
|
||||
|
||||
DATE_COLUMNS = {
|
||||
"Orig Exp Date", "Exp Date", "Rcv Date",
|
||||
"Date Asgn", "Disp Date", "Date Ret", "Destroyed", "Max Visit Date",
|
||||
"Visit Date", "Scheduled Date",
|
||||
}
|
||||
|
||||
N_SHIP_COLS = 9 # počet shipment sloupců před detail sloupci
|
||||
|
||||
|
||||
# ── Načítání dat z MongoDB ────────────────────────────────────────────────────
|
||||
|
||||
INVENTORY_COLS = [
|
||||
("site", "Site"),
|
||||
("medication_id", "Med ID"),
|
||||
("packaged_lot_no", "Lot No."),
|
||||
("original_expiration_date", "Orig Exp Date"),
|
||||
("expiration_date", "Exp Date"),
|
||||
("received_date", "Rcv Date"),
|
||||
("receipt_user", "Rcpt User"),
|
||||
("subject_identifier", "Subject ID"),
|
||||
("quantity_assigned", "Qty Asgn"),
|
||||
("irt_transaction", "IRT Tx"),
|
||||
("date_assigned", "Date Asgn"),
|
||||
("assignment_user", "Asgn User"),
|
||||
("dispensation_status", "Disp Status"),
|
||||
("dispensing_date", "Disp Date"),
|
||||
("quantity_dispensed", "Qty Disp"),
|
||||
("dispensing_user", "Disp User"),
|
||||
("quantity_returned", "Qty Ret"),
|
||||
("date_returned", "Date Ret"),
|
||||
("return_user", "Ret User"),
|
||||
]
|
||||
|
||||
|
||||
def load_inventory(study):
|
||||
db = get_db()
|
||||
inv = list(db.iwrs_inventory.find({"study": study}))
|
||||
destr = list(db.iwrs_destruction.find({"study": study}))
|
||||
# map medication_id -> first basket+date
|
||||
destr_map = {}
|
||||
for d in destr:
|
||||
mid = d.get("medication_id")
|
||||
if mid and mid not in destr_map:
|
||||
destr_map[mid] = (d.get("basket_id"), d.get("destruction_date"))
|
||||
|
||||
records = []
|
||||
for doc in inv:
|
||||
row = {label: doc.get(key) for key, label in INVENTORY_COLS}
|
||||
b, dt = destr_map.get(doc.get("medication_id"), (None, None))
|
||||
row["Destroyed"] = dt
|
||||
row["Basket No."] = b
|
||||
records.append(row)
|
||||
|
||||
df = pd.DataFrame(records)
|
||||
if df.empty:
|
||||
print(" Inventory: 0 kitu")
|
||||
return df
|
||||
|
||||
df = df.sort_values(["Site", "Rcv Date", "Med ID"], na_position="last").reset_index(drop=True)
|
||||
for col in DATE_COLUMNS:
|
||||
if col in df.columns:
|
||||
df[col] = pd.to_datetime(df[col], errors="coerce")
|
||||
print(f" Inventory: {len(df)} kitu")
|
||||
return df
|
||||
|
||||
|
||||
SHIP_COLS = [
|
||||
("shipment_id", "Shipment ID"),
|
||||
("status", "IRT Shipment Status"),
|
||||
("type", "Type"),
|
||||
("ship_from", "Shipment From"),
|
||||
("ship_to_site", "Ship To:"),
|
||||
("request_date", "Request Date"),
|
||||
("received_date", "Received Date"),
|
||||
("received_by", "Received by"),
|
||||
("expected_arrival", "Expected Arrival"),
|
||||
]
|
||||
|
||||
ITEM_COLS = [
|
||||
("investigator", "Investigator"),
|
||||
("medication_description", "Medication Description"),
|
||||
("medication_id", "Medication ID"),
|
||||
("packaged_lot_no", "Packaged Lot number"),
|
||||
("expiration_date", "Expiration Date"),
|
||||
("item_status", "Status"),
|
||||
]
|
||||
|
||||
|
||||
def load_shipments(study):
|
||||
db = get_db()
|
||||
ships = list(db.iwrs_shipments.find({"study": study}))
|
||||
items = list(db.iwrs_shipment_items.find({"study": study}))
|
||||
|
||||
# index items by shipment_id
|
||||
items_by_ship = {}
|
||||
for it in items:
|
||||
items_by_ship.setdefault(it.get("shipment_id"), []).append(it)
|
||||
|
||||
records = []
|
||||
for s in ships:
|
||||
base = {label: s.get(key) for key, label in SHIP_COLS}
|
||||
for it in items_by_ship.get(s.get("shipment_id"), []):
|
||||
row = dict(base)
|
||||
for key, label in ITEM_COLS:
|
||||
row[label] = it.get(key)
|
||||
records.append(row)
|
||||
|
||||
df = pd.DataFrame(records)
|
||||
if df.empty:
|
||||
print(" Shipments: 0 zásilek, 0 kitu")
|
||||
return df
|
||||
|
||||
df = df.sort_values(["Ship To:", "Shipment ID", "Medication ID"], na_position="last").reset_index(drop=True)
|
||||
for col in ("Request Date", "Received Date", "Expiration Date", "Expected Arrival"):
|
||||
if col in df.columns:
|
||||
df[col] = pd.to_datetime(df[col], errors="coerce")
|
||||
n_ship = df["Shipment ID"].nunique()
|
||||
print(f" Shipments: {n_ship} zásilek, {len(df)} kitu")
|
||||
return df
|
||||
|
||||
|
||||
def load_visits(study):
|
||||
db = get_db()
|
||||
cur = db.iwrs_visits.find({
|
||||
"study": study,
|
||||
"visit_type": "Past",
|
||||
"irt_transaction_no": {"$ne": None},
|
||||
})
|
||||
rows = []
|
||||
for v in cur:
|
||||
rows.append({
|
||||
"Subject": v.get("subject"),
|
||||
"Visit Date": v.get("actual_date") or v.get("scheduled_date"),
|
||||
"Scheduled Date": v.get("scheduled_date"),
|
||||
"IRT Tx No": v.get("irt_transaction_no"),
|
||||
"Visit": v.get("irt_transaction_description"),
|
||||
"Medication": v.get("medication_assignment"),
|
||||
"medication_id": v.get("medication_id"),
|
||||
"quantity_assigned": v.get("quantity_assigned"),
|
||||
})
|
||||
df = pd.DataFrame(rows)
|
||||
if df.empty:
|
||||
print(" Visits: 0 radku")
|
||||
return df
|
||||
|
||||
# GROUP BY subject/actual/scheduled/irt_no/desc/medication
|
||||
grouped = (
|
||||
df.groupby(["Subject", "Visit Date", "Scheduled Date", "IRT Tx No", "Visit", "Medication"],
|
||||
dropna=False, as_index=False)
|
||||
.agg(**{
|
||||
"Med IDs": ("medication_id", lambda s: ", ".join(sorted([str(x) for x in s if pd.notna(x)]))),
|
||||
"Qty": ("quantity_assigned", "sum"),
|
||||
})
|
||||
)
|
||||
grouped = grouped.sort_values(["Subject", "Visit Date"]).reset_index(drop=True)
|
||||
for col in ("Visit Date", "Scheduled Date"):
|
||||
if col in grouped.columns:
|
||||
grouped[col] = pd.to_datetime(grouped[col], errors="coerce")
|
||||
if study == "77242113UCO3001":
|
||||
grouped["Visit"] = grouped["Visit"].replace("Subject Number Creation", "Screening")
|
||||
print(f" Visits: {len(grouped)} řádků")
|
||||
return grouped
|
||||
|
||||
|
||||
# ── Odvozené sheety ───────────────────────────────────────────────────────────
|
||||
|
||||
def build_site_summary(shipments_df):
|
||||
STATUS_COLS = ["Available", "Assigned", "Dispensed", "Returned by Subject"]
|
||||
pivot = shipments_df.groupby("Ship To:")["Status"].value_counts().unstack(fill_value=0)
|
||||
for s in STATUS_COLS:
|
||||
if s not in pivot.columns:
|
||||
pivot[s] = 0
|
||||
pivot = (
|
||||
pivot[STATUS_COLS]
|
||||
.reset_index()
|
||||
.rename(columns={"Ship To:": "Site", "Returned by Subject": "Returned"})
|
||||
.sort_values("Site")
|
||||
.reset_index(drop=True)
|
||||
)
|
||||
pivot["Total"] = pivot[["Available", "Assigned", "Dispensed", "Returned"]].sum(axis=1)
|
||||
print(f" Site Summary: {len(pivot)} center")
|
||||
return pivot
|
||||
|
||||
|
||||
def build_expired(df):
|
||||
today = date.today()
|
||||
mask = (
|
||||
df["Basket No."].isna() &
|
||||
df["Subject ID"].isna() &
|
||||
(df["Exp Date"] < pd.Timestamp(today))
|
||||
)
|
||||
filtered = df[mask].copy().reset_index(drop=True)
|
||||
sheet_name = f"Expired as of {today.strftime('%d-%b-%Y')}"
|
||||
print(f" Expired: {len(filtered)}")
|
||||
return filtered, sheet_name
|
||||
|
||||
|
||||
def build_assigned_not_dispensed(df):
|
||||
mask = df["Subject ID"].notna() & df["Disp Date"].isna()
|
||||
filtered = df[mask].copy().reset_index(drop=True)
|
||||
print(f" Assigned not dispensed: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
def build_not_returned(df):
|
||||
no_ret = df[
|
||||
df["Date Ret"].isna() &
|
||||
df["Subject ID"].notna() &
|
||||
(df["Disp Status"].fillna("").str.upper() != "NOT DISPENSED")
|
||||
].copy()
|
||||
max_asgn = df.groupby("Subject ID")["Date Asgn"].max().rename("Max Visit Date")
|
||||
no_ret = no_ret.join(max_asgn, on="Subject ID")
|
||||
filtered = no_ret[no_ret["Date Asgn"] < no_ret["Max Visit Date"]].copy()
|
||||
filtered = filtered.drop(columns=["Qty Ret", "Date Ret", "Ret User", "Destroyed", "Basket No."])
|
||||
filtered = filtered.reset_index(drop=True)
|
||||
print(f" Not returned: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
def build_kits_for_destruction(df):
|
||||
mask = (
|
||||
df["Basket No."].isna() &
|
||||
(df["Date Ret"].notna() | (df["Disp Status"].fillna("").str.upper() == "NOT DISPENSED"))
|
||||
)
|
||||
filtered = (
|
||||
df[mask]
|
||||
.copy()
|
||||
.sort_values(["Site", "Date Ret"], ascending=[True, True])
|
||||
.drop(columns=["Destroyed", "Basket No."])
|
||||
.reset_index(drop=True)
|
||||
)
|
||||
print(f" Kits for destruction: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
# ── Formátování ───────────────────────────────────────────────────────────────
|
||||
|
||||
STRIPE_GRAY = PatternFill("solid", start_color="F2F2F2")
|
||||
STRIPE_WHITE = PatternFill("solid", start_color="FFFFFF")
|
||||
|
||||
# pacienti — styly zachovány z create_subject_report.py
|
||||
_PAT_HEADER_FILL = PatternFill("solid", start_color="1F4E79")
|
||||
_PAT_HEADER_FONT = Font(name="Arial", bold=True, color="FFFFFF", size=10)
|
||||
_PAT_NORMAL_FONT = Font(name="Arial", size=10)
|
||||
_PAT_BOLD_FONT = Font(name="Arial", bold=True, size=10)
|
||||
_PAT_STRIKE_FONT = Font(name="Arial", size=10, strike=True, color="999999")
|
||||
_PAT_ADOLESC_FONT = Font(name="Arial", bold=True, size=10)
|
||||
_PAT_THIN = Side(style="thin", color="CCCCCC")
|
||||
_PAT_BORDER = Border(left=_PAT_THIN, right=_PAT_THIN, top=_PAT_THIN, bottom=_PAT_THIN)
|
||||
_PAT_EVEN_FILL = PatternFill("solid", start_color="EBF3FB")
|
||||
_PAT_ODD_FILL = PatternFill("solid", start_color="FFFFFF")
|
||||
_PAT_CENTER = Alignment(horizontal="center", vertical="center")
|
||||
_PAT_LEFT = Alignment(horizontal="left", vertical="center")
|
||||
|
||||
|
||||
def _autofit(ws):
|
||||
for col_cells in ws.columns:
|
||||
max_len = 0
|
||||
col_letter = get_column_letter(col_cells[0].column)
|
||||
for cell in col_cells:
|
||||
if cell.value is None:
|
||||
continue
|
||||
# datum se zobrazí jako DD-MMM-YYYY = 11 znaků
|
||||
if hasattr(cell.value, "strftime") or cell.number_format == "DD-MMM-YYYY":
|
||||
length = 11
|
||||
else:
|
||||
length = len(str(cell.value))
|
||||
if length > max_len:
|
||||
max_len = length
|
||||
ws.column_dimensions[col_letter].width = min(max_len + 3, 50)
|
||||
|
||||
|
||||
def format_sheet(ws, header_color, highlight_col=None, highlight_color=None):
|
||||
thin = Side(style="thin", color="000000")
|
||||
border = Border(left=thin, right=thin, top=thin, bottom=thin)
|
||||
header_fill = PatternFill("solid", start_color=header_color)
|
||||
header_font = Font(bold=True, color="FFFFFF", name="Arial", size=10)
|
||||
row_font = Font(name="Arial", size=10)
|
||||
hi_fill = PatternFill("solid", start_color=highlight_color) if highlight_color else None
|
||||
|
||||
headers = [cell.value for cell in ws[1]]
|
||||
|
||||
for cell in ws[1]:
|
||||
cell.fill = header_fill
|
||||
cell.font = header_font
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=False)
|
||||
cell.border = border
|
||||
|
||||
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
|
||||
stripe = STRIPE_GRAY if row[0].row % 2 == 0 else STRIPE_WHITE
|
||||
for cell in row:
|
||||
col_name = headers[cell.column - 1] if cell.column <= len(headers) else None
|
||||
cell.font = row_font
|
||||
cell.border = border
|
||||
cell.alignment = Alignment(horizontal="center")
|
||||
if col_name in DATE_COLUMNS:
|
||||
cell.number_format = "DD-MMM-YYYY"
|
||||
if hi_fill and col_name == highlight_col:
|
||||
cell.fill = hi_fill
|
||||
else:
|
||||
cell.fill = stripe
|
||||
|
||||
_autofit(ws)
|
||||
ws.auto_filter.ref = ws.dimensions
|
||||
ws.freeze_panes = "A2"
|
||||
|
||||
|
||||
def format_shipment_sheet(ws, header_color_ship, header_color_detail, n_ship_cols):
|
||||
thin = Side(style="thin", color="000000")
|
||||
border = Border(left=thin, right=thin, top=thin, bottom=thin)
|
||||
hfont = Font(bold=True, color="FFFFFF", name="Arial", size=10)
|
||||
dfont = Font(name="Arial", size=10)
|
||||
fill_ship = PatternFill("solid", start_color=header_color_ship)
|
||||
fill_detail = PatternFill("solid", start_color=header_color_detail)
|
||||
|
||||
for cell in ws[1]:
|
||||
cell.fill = fill_ship if cell.column <= n_ship_cols else fill_detail
|
||||
cell.font = hfont
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
|
||||
cell.border = border
|
||||
ws.row_dimensions[1].height = 30
|
||||
|
||||
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
|
||||
stripe = STRIPE_GRAY if row[0].row % 2 == 0 else STRIPE_WHITE
|
||||
for cell in row:
|
||||
cell.font = dfont
|
||||
cell.border = border
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center")
|
||||
cell.fill = stripe
|
||||
if cell.value.__class__.__name__ in ("datetime", "date", "Timestamp"):
|
||||
cell.number_format = "DD-MMM-YYYY"
|
||||
|
||||
_autofit(ws)
|
||||
ws.auto_filter.ref = ws.dimensions
|
||||
ws.freeze_panes = "A2"
|
||||
|
||||
|
||||
# ── Pacienti ─────────────────────────────────────────────────────────────────
|
||||
|
||||
def load_patients(study):
|
||||
db = get_db()
|
||||
docs = list(db.iwrs_subject_summary.find({"study": study}))
|
||||
if not docs:
|
||||
raise RuntimeError(f"Žádná data v Mongo pro pacienty {study}")
|
||||
|
||||
base_cols = [
|
||||
("subject", "Subject"),
|
||||
("investigator", "Investigator"),
|
||||
("age", "Subject's age collection"),
|
||||
("cohort_per_irt", "Cohort per IRT"),
|
||||
("irt_subject_status", "IRT Subject Status"),
|
||||
("last_irt_transaction", "Last Recorded IRT Transaction"),
|
||||
("next_irt_transaction", "Next Expected IRT Transaction"),
|
||||
("next_irt_transaction_date_local", "Next Expected IRT Transaction Date [Local]"),
|
||||
]
|
||||
uco_extra = [
|
||||
("rescreened_subject", "Rescreened Subject"),
|
||||
("adt_ir", "ADT-IR"),
|
||||
("three_or_more_advanced_therapies", "3+ Adv. Therapies"),
|
||||
("only_oral_5asa_compounds", "Only 5-ASA"),
|
||||
("ustekinumab", "Ustekinumab"),
|
||||
("isolated_proctitis", "Isolated Proctitis"),
|
||||
]
|
||||
cols = list(base_cols)
|
||||
if study == "77242113UCO3001":
|
||||
cols += uco_extra
|
||||
|
||||
rows = [{label: d.get(key) for key, label in cols} for d in docs]
|
||||
df = pd.DataFrame(rows).sort_values("Subject").reset_index(drop=True)
|
||||
|
||||
if "Next Expected IRT Transaction Date [Local]" in df.columns:
|
||||
df["Next Expected IRT Transaction Date [Local]"] = pd.to_datetime(
|
||||
df["Next Expected IRT Transaction Date [Local]"], errors="coerce"
|
||||
)
|
||||
print(f" Pacienti: {len(df)} subjektů")
|
||||
return df
|
||||
|
||||
|
||||
def _simplify_cohort(val):
|
||||
if pd.isna(val):
|
||||
return ""
|
||||
val = str(val)
|
||||
if "dolescent" in val:
|
||||
return "Adolescent"
|
||||
if val.startswith("Adult"):
|
||||
return "Adult"
|
||||
return val
|
||||
|
||||
|
||||
def _fmt_date(val):
|
||||
if pd.isna(val):
|
||||
return ""
|
||||
if hasattr(val, "strftime"):
|
||||
return val.strftime("%Y-%m-%d")
|
||||
return str(val)[:10]
|
||||
|
||||
|
||||
def _write_prehled(wb, df_raw, study):
|
||||
ws = wb.create_sheet("Přehled", 0)
|
||||
ws.sheet_view.showGridLines = False
|
||||
|
||||
is_uco = (study == "77242113UCO3001")
|
||||
|
||||
if is_uco:
|
||||
display_headers = ["Subject", "Investigator", "Věk", "Cohort",
|
||||
"Rescreened", "ADT-IR", "≥3 Adv.Th.", "5-ASA only",
|
||||
"Uste.", "Isol.Proct.",
|
||||
"Status", "Last IRT", "Next Visit", "Next Date"]
|
||||
col_widths = [14, 22, 6, 12, 11, 8, 11, 10, 8, 12, 14, 12, 12, 13]
|
||||
status_col = 11
|
||||
flag_cols = set(range(5, 11)) # 1-indexed sloupce s Yes/No hodnotami
|
||||
else:
|
||||
display_headers = ["Subject", "Investigator", "Věk", "Cohort", "Status", "Last IRT", "Next Visit", "Next Date"]
|
||||
col_widths = [14, 22, 6, 12, 14, 12, 12, 13]
|
||||
status_col = 5
|
||||
flag_cols = set()
|
||||
|
||||
last_col = get_column_letter(len(display_headers))
|
||||
ws.merge_cells(f"A1:{last_col}1")
|
||||
title = ws["A1"]
|
||||
title.value = f"Subject Summary — {study} ({date.today().strftime('%d-%b-%Y')})"
|
||||
title.font = Font(name="Arial", bold=True, size=12, color="1F4E79")
|
||||
title.alignment = Alignment(horizontal="left", vertical="center")
|
||||
ws.row_dimensions[1].height = 22
|
||||
|
||||
for c, (h, w) in enumerate(zip(display_headers, col_widths), 1):
|
||||
cell = ws.cell(row=2, column=c, value=h)
|
||||
cell.font = _PAT_HEADER_FONT
|
||||
cell.fill = _PAT_HEADER_FILL
|
||||
cell.alignment = _PAT_CENTER
|
||||
cell.border = _PAT_BORDER
|
||||
ws.column_dimensions[get_column_letter(c)].width = w
|
||||
ws.row_dimensions[2].height = 18
|
||||
|
||||
base = {
|
||||
"Subject": df_raw["Subject"].fillna(""),
|
||||
"Investigator": df_raw["Investigator"].fillna(""),
|
||||
"Věk": df_raw["Subject's age collection"].apply(lambda v: "" if pd.isna(v) else int(v)),
|
||||
"Cohort": df_raw["Cohort per IRT"].apply(_simplify_cohort),
|
||||
}
|
||||
if is_uco:
|
||||
base.update({
|
||||
"Rescreened": df_raw["Rescreened Subject"].fillna(""),
|
||||
"ADT-IR": df_raw["ADT-IR"].fillna(""),
|
||||
"≥3 Adv.Th.": df_raw["3+ Adv. Therapies"].fillna(""),
|
||||
"5-ASA only": df_raw["Only 5-ASA"].fillna(""),
|
||||
"Uste.": df_raw["Ustekinumab"].fillna(""),
|
||||
"Isol.Proct.": df_raw["Isolated Proctitis"].fillna(""),
|
||||
})
|
||||
base.update({
|
||||
"Status": df_raw["IRT Subject Status"].fillna(""),
|
||||
"Last IRT": df_raw["Last Recorded IRT Transaction"].fillna("—"),
|
||||
"Next Visit": df_raw["Next Expected IRT Transaction"].fillna("—"),
|
||||
"Next Date": df_raw["Next Expected IRT Transaction Date [Local]"].apply(_fmt_date),
|
||||
})
|
||||
display = pd.DataFrame(base).sort_values("Subject").reset_index(drop=True)
|
||||
|
||||
for r_idx, row in display.iterrows():
|
||||
excel_row = r_idx + 3
|
||||
status = str(row["Status"])
|
||||
is_failed = "Screen Failed" in status or "Discontinued" in status
|
||||
is_randomized = "Randomized" in status
|
||||
is_adolescent = row["Cohort"] == "Adolescent"
|
||||
fill = _PAT_EVEN_FILL if r_idx % 2 == 0 else _PAT_ODD_FILL
|
||||
|
||||
for c_idx, val in enumerate(row, 1):
|
||||
cell = ws.cell(row=excel_row, column=c_idx, value=val if val != "" else None)
|
||||
cell.fill = fill
|
||||
cell.border = _PAT_BORDER
|
||||
cell.alignment = _PAT_CENTER if (c_idx == 3 or c_idx in flag_cols) else _PAT_LEFT
|
||||
if is_failed:
|
||||
cell.font = _PAT_STRIKE_FONT
|
||||
elif c_idx == status_col and is_randomized:
|
||||
cell.font = _PAT_BOLD_FONT
|
||||
elif c_idx == 4 and is_adolescent:
|
||||
cell.font = _PAT_ADOLESC_FONT
|
||||
else:
|
||||
cell.font = _PAT_NORMAL_FONT
|
||||
ws.row_dimensions[excel_row].height = 16
|
||||
|
||||
ws.freeze_panes = "A3"
|
||||
ws.auto_filter.ref = f"A2:{last_col}{len(display) + 2}"
|
||||
|
||||
|
||||
def _write_next_visits(wb, df_raw, study, visits_df=None):
|
||||
ws = wb.create_sheet("Next Visits", 1)
|
||||
ws.sheet_view.showGridLines = False
|
||||
|
||||
ws.merge_cells("A1:D1")
|
||||
title = ws["A1"]
|
||||
title.value = f"Next Expected Visits — {study} ({date.today().strftime('%d-%b-%Y')})"
|
||||
title.font = Font(name="Arial", bold=True, size=12, color="1F4E79")
|
||||
title.alignment = Alignment(horizontal="left", vertical="center")
|
||||
ws.row_dimensions[1].height = 22
|
||||
|
||||
nv_headers = ["Subject", "Investigator", "Next Visit", "Datum"]
|
||||
nv_widths = [14, 22, 26, 13]
|
||||
for c, (h, w) in enumerate(zip(nv_headers, nv_widths), 1):
|
||||
cell = ws.cell(row=2, column=c, value=h)
|
||||
cell.font = _PAT_HEADER_FONT
|
||||
cell.fill = _PAT_HEADER_FILL
|
||||
cell.alignment = _PAT_CENTER
|
||||
cell.border = _PAT_BORDER
|
||||
ws.column_dimensions[get_column_letter(c)].width = w
|
||||
ws.row_dimensions[2].height = 18
|
||||
|
||||
df = pd.DataFrame({
|
||||
"Subject": df_raw["Subject"].fillna(""),
|
||||
"Investigator": df_raw["Investigator"].fillna(""),
|
||||
"Next Visit": df_raw["Next Expected IRT Transaction"].fillna(""),
|
||||
"Datum": df_raw["Next Expected IRT Transaction Date [Local]"],
|
||||
"Status": df_raw["IRT Subject Status"].fillna(""),
|
||||
})
|
||||
|
||||
# I-0: datum = screening date + 42 dní
|
||||
if visits_df is not None and not visits_df.empty:
|
||||
screen = (
|
||||
visits_df[visits_df["Visit"].str.contains("Screen", case=False, na=False)]
|
||||
.groupby("Subject")["Visit Date"].min()
|
||||
.rename("Screening Date")
|
||||
)
|
||||
df = df.join(screen, on="Subject")
|
||||
mask_i0 = df["Next Visit"].str.contains("I-0", na=False)
|
||||
df.loc[mask_i0, "Datum"] = df.loc[mask_i0, "Screening Date"] + pd.Timedelta(days=42)
|
||||
df = df.drop(columns=["Screening Date"])
|
||||
|
||||
df = df[df["Datum"].notna()]
|
||||
df = df[~df["Status"].str.contains("Screen Failed|Discontinued", na=False)]
|
||||
df = df.sort_values("Datum").reset_index(drop=True)
|
||||
|
||||
for r_idx, row in df.iterrows():
|
||||
excel_row = r_idx + 3
|
||||
fill = _PAT_EVEN_FILL if r_idx % 2 == 0 else _PAT_ODD_FILL
|
||||
datum_val = row["Datum"]
|
||||
datum_str = datum_val.strftime("%Y-%m-%d") if hasattr(datum_val, "strftime") else str(datum_val)[:10]
|
||||
for c_idx, val in enumerate([row["Subject"], row["Investigator"], row["Next Visit"], datum_str], 1):
|
||||
cell = ws.cell(row=excel_row, column=c_idx, value=val if val != "" else None)
|
||||
cell.fill = fill
|
||||
cell.border = _PAT_BORDER
|
||||
cell.font = _PAT_NORMAL_FONT
|
||||
cell.alignment = _PAT_LEFT
|
||||
ws.row_dimensions[excel_row].height = 16
|
||||
|
||||
ws.freeze_panes = "A3"
|
||||
ws.auto_filter.ref = f"A2:D{len(df) + 2}"
|
||||
|
||||
|
||||
# ── Jeden report pro jednu studii ─────────────────────────────────────────────
|
||||
|
||||
def create_study_report(study):
|
||||
today = date.today()
|
||||
|
||||
# číslování: najdi nejvyšší existující verzi pro dnešní datum
|
||||
existing = sorted(OUTPUT_DIR.glob(f"{today} {study} CZ IWRS overview v*.xlsx"))
|
||||
if existing:
|
||||
last = existing[-1].stem # např. "2026-05-12 42847922MDD3003 CZ IWRS overview v3"
|
||||
last_ver = int(last.rsplit("v", 1)[-1])
|
||||
version = last_ver + 1
|
||||
else:
|
||||
version = 1
|
||||
|
||||
output_file = OUTPUT_DIR / f"{today} {study} CZ IWRS overview v{version}.xlsx"
|
||||
|
||||
print(f"\n[{study}] Nacitam z MongoDB...")
|
||||
df = load_inventory(study)
|
||||
shipments_df = load_shipments(study)
|
||||
df_patients = load_patients(study)
|
||||
visits_df = load_visits(study)
|
||||
|
||||
expired_df, expired_sheet = build_expired(df)
|
||||
assigned_df = build_assigned_not_dispensed(df)
|
||||
not_returned_df = build_not_returned(df)
|
||||
destruction_df = build_kits_for_destruction(df)
|
||||
site_summary_df = build_site_summary(shipments_df)
|
||||
|
||||
with pd.ExcelWriter(output_file, engine="openpyxl") as writer:
|
||||
df.to_excel( writer, index=False, sheet_name="CountryMedicationOverview")
|
||||
expired_df.to_excel( writer, index=False, sheet_name=expired_sheet)
|
||||
assigned_df.to_excel( writer, index=False, sheet_name="Assigned not dispensed")
|
||||
not_returned_df.to_excel( writer, index=False, sheet_name="Not returned")
|
||||
destruction_df.to_excel( writer, index=False, sheet_name="Kits for destruction")
|
||||
shipments_df.to_excel( writer, index=False, sheet_name="Shipments")
|
||||
site_summary_df.to_excel( writer, index=False, sheet_name="Site Summary")
|
||||
visits_df.to_excel( writer, index=False, sheet_name="Patient Visits")
|
||||
|
||||
wb = load_workbook(output_file)
|
||||
|
||||
ws_main = wb["CountryMedicationOverview"]
|
||||
format_sheet(ws_main, header_color="1F4E79")
|
||||
green_fill = PatternFill("solid", start_color="E2EFDA")
|
||||
headers_main = [c.value for c in ws_main[1]]
|
||||
for row in ws_main.iter_rows(min_row=2, max_row=ws_main.max_row):
|
||||
for cell in row:
|
||||
col_name = headers_main[cell.column - 1] if cell.column <= len(headers_main) else None
|
||||
if col_name in ("Destroyed", "Basket No."):
|
||||
cell.fill = green_fill
|
||||
|
||||
format_sheet(wb[expired_sheet], header_color="C00000", highlight_col="Exp Date", highlight_color="FFE0E0")
|
||||
format_sheet(wb["Assigned not dispensed"], header_color="833C00", highlight_col="Subject ID", highlight_color="FFF2CC")
|
||||
format_sheet(wb["Not returned"], header_color="375623", highlight_col="Max Visit Date", highlight_color="E2EFDA")
|
||||
format_sheet(wb["Kits for destruction"], header_color="595959")
|
||||
format_shipment_sheet(wb["Shipments"], "1F4E79", "375623", N_SHIP_COLS)
|
||||
format_sheet(wb["Site Summary"], header_color="1F4E79")
|
||||
format_sheet(wb["Patient Visits"], header_color="1F4E79")
|
||||
|
||||
# ── pacienti (Přehled + Next Visits) na začátek ──────────────────────────
|
||||
_write_prehled(wb, df_patients, study)
|
||||
_write_next_visits(wb, df_patients, study, visits_df)
|
||||
|
||||
# ── pořadí listů: Patient Visits jako první ──────────────────────────────
|
||||
names = wb.sheetnames
|
||||
wb._sheets = [wb["Patient Visits"]] + [wb[s] for s in names if s != "Patient Visits"]
|
||||
|
||||
wb.save(output_file)
|
||||
print(f" Uloženo: {output_file.name} ({len(df)} řádků)")
|
||||
|
||||
|
||||
# ── Main ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
||||
for study in STUDIES:
|
||||
try:
|
||||
create_study_report(study)
|
||||
except Exception as e:
|
||||
import traceback
|
||||
print(f"\n[{study}] CHYBA: {e}")
|
||||
traceback.print_exc()
|
||||
print("\nHotovo.")
|
||||
|
||||
|
||||
main()
|
||||
@@ -0,0 +1,253 @@
|
||||
"""
|
||||
Import Drugs dat (shipments, shipment_items, inventory, destruction) z XLSX do MongoDB.
|
||||
|
||||
Volá se z IWRS/Drugs/run_all.py po stažení reportů.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import re
|
||||
import glob
|
||||
|
||||
import pandas as pd
|
||||
|
||||
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
|
||||
from common.mongo_writer import (
|
||||
to_str, to_int, to_date,
|
||||
ensure_indexes, log_import,
|
||||
bulk_upsert_with_snapshot, bulk_upsert_only,
|
||||
)
|
||||
|
||||
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
|
||||
# ── XLSX parsery (převzaté z run_all.py + úprava na Mongo dokumenty) ─────────
|
||||
|
||||
def parse_shipments_report(study):
|
||||
path = os.path.join(BASE_DIR, f"xls_shipments_{study}", f"shipments_report_{study}.xlsx")
|
||||
if not os.path.exists(path):
|
||||
print(f" CHYBI: {path}")
|
||||
return []
|
||||
raw = pd.read_excel(path, header=None)
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
if "Shipment ID" in [str(v).strip() for v in row]:
|
||||
header_row = i
|
||||
break
|
||||
if header_row is None:
|
||||
return []
|
||||
df = pd.read_excel(path, header=header_row).dropna(how="all")
|
||||
df = df[df["Location"].astype(str).str.contains("Czech", na=False, case=False)]
|
||||
col = df.columns.tolist()
|
||||
rows = []
|
||||
for _, r in df.iterrows():
|
||||
sid = to_str(r["Shipment ID"])
|
||||
if not sid:
|
||||
continue
|
||||
rows.append({
|
||||
"_id": sid,
|
||||
"shipment_id": sid,
|
||||
"study": study,
|
||||
"status": to_str(r["IRT Shipment Status"]),
|
||||
"type": to_str(r["Type"]),
|
||||
"ship_from": to_str(r["Shipment From"]),
|
||||
"ship_to_site": to_str(r["Ship To:"]),
|
||||
"location": to_str(r["Location"]),
|
||||
"request_date": to_date(r["Request Date"]),
|
||||
"shipped_date": to_date(r["Shipped Date"]),
|
||||
"received_date": to_date(r["Received Date"]) if "Received Date" in col else None,
|
||||
"received_by": to_str(r["Received by"]) if "Received by" in col else None,
|
||||
"delivered_date_utc": to_date(r["Delivered Date [UTC]"]) if "Delivered Date [UTC]" in col else None,
|
||||
"delivery_recipient": to_str(r["Delivery Recipient"]) if "Delivery Recipient" in col else None,
|
||||
"delivery_details": to_str(r["Delivery Details"]) if "Delivery Details" in col else None,
|
||||
"cancelled_date": to_date(r["Cancelled Date"]) if "Cancelled Date" in col else None,
|
||||
"total_medication_ids": to_int(r["Total Medication IDs"]) if "Total Medication IDs" in col else None,
|
||||
"tracking_no": to_str(r["Tracking #"]) if "Tracking #" in col else None,
|
||||
"shipping_category": to_str(r["Shipping Category"]) if "Shipping Category" in col else None,
|
||||
"expected_arrival": to_date(r["Expected Arrival"]) if "Expected Arrival" in col else None,
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def parse_shipment_details(study):
|
||||
detail_dir = os.path.join(BASE_DIR, f"xls_shipment_details_{study}")
|
||||
files = sorted(glob.glob(os.path.join(detail_dir, "shipment_details_*.xlsx")))
|
||||
rows = []
|
||||
for path in files:
|
||||
m = re.search(r"shipment_details_(.+)\.xlsx", os.path.basename(path))
|
||||
shipment_id = m.group(1) if m else "UNKNOWN"
|
||||
raw = pd.read_excel(path, header=None)
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
if "Medication ID" in [str(v).strip() for v in row]:
|
||||
header_row = i
|
||||
break
|
||||
if header_row is None:
|
||||
continue
|
||||
df = pd.read_excel(path, header=header_row).dropna(how="all")
|
||||
for _, r in df.iterrows():
|
||||
med_desc = (to_str(r.get("Medication Description"))
|
||||
or to_str(r.get("Medication ID Description")))
|
||||
med_type = (to_str(r.get("Medication type"))
|
||||
or to_str(r.get("Medication ID type")))
|
||||
med_id = to_str(r.get("Medication ID"))
|
||||
if not med_id:
|
||||
continue
|
||||
rows.append({
|
||||
"_id": f"{shipment_id}:{med_id}",
|
||||
"study": study,
|
||||
"shipment_id": shipment_id,
|
||||
"destination_location": to_str(r.get("Destination Location")),
|
||||
"shipment_status": to_str(r.get("IRT Shipment Status")),
|
||||
"shipment_type": to_str(r.get("Type")),
|
||||
"destination_site": to_str(r.get("Destination Site")),
|
||||
"investigator": to_str(r.get("Investigator")),
|
||||
"medication_description": med_desc,
|
||||
"medication_type": med_type,
|
||||
"medication_id": med_id,
|
||||
"packaged_lot_no": to_str(r.get("Packaged Lot number")),
|
||||
"packaged_lot_description": to_str(r.get("Packaged Lot description")),
|
||||
"container_id": to_str(r.get("Container ID")),
|
||||
"quantity": to_int(r.get("Quantity of Medication IDs")),
|
||||
"expiration_date": to_date(r.get("Expiration Date")),
|
||||
"item_status": to_str(r.get("Status")),
|
||||
})
|
||||
# dedupe (poslední vyhrává)
|
||||
by_id = {r["_id"]: r for r in rows}
|
||||
return list(by_id.values())
|
||||
|
||||
|
||||
def parse_inventory(study):
|
||||
inv_dir = os.path.join(BASE_DIR, f"xls_reports_{study}")
|
||||
files = sorted(glob.glob(os.path.join(inv_dir, "onsite_inventory_detail_*.xlsx")))
|
||||
rows = []
|
||||
for path in files:
|
||||
raw = pd.read_excel(path, header=None)
|
||||
site = investigator = location = None
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
first = str(row.iloc[0]).strip() if pd.notna(row.iloc[0]) else ""
|
||||
if first.startswith("Site:"):
|
||||
site = first.replace("Site:", "").strip()
|
||||
elif first.startswith("Investigator:"):
|
||||
investigator = first.replace("Investigator:", "").strip()
|
||||
elif first.startswith("Location:"):
|
||||
location = first.replace("Location:", "").strip()
|
||||
if first in ("Medication", "Medication ID") and header_row is None:
|
||||
header_row = i
|
||||
if header_row is None:
|
||||
continue
|
||||
df = pd.read_excel(path, header=header_row).dropna(how="all")
|
||||
df = df.rename(columns={df.columns[0]: "medication_id"})
|
||||
for _, r in df.iterrows():
|
||||
med_id = to_str(r["medication_id"])
|
||||
if not med_id or not site:
|
||||
continue
|
||||
rows.append({
|
||||
"_id": f"{site}:{med_id}",
|
||||
"study": study,
|
||||
"site": site,
|
||||
"investigator": investigator,
|
||||
"location": location,
|
||||
"medication_id": med_id,
|
||||
"packaged_lot_no": to_str(r.get("Packaged Lot number")),
|
||||
"original_expiration_date": to_date(r.get("Original Expiration Date when Packaged Lot was Added")),
|
||||
"expiration_date": to_date(r.get("Expiration date")),
|
||||
"received_date": to_date(r.get("Received Date")),
|
||||
"receipt_user": to_str(r.get("Shipment Receipt User")),
|
||||
"subject_identifier": to_str(r.get("Subject Identifier")),
|
||||
"quantity_assigned": to_int(r.get("Quantity Assigned")),
|
||||
"irt_transaction": to_str(r.get("IRT Transaction")),
|
||||
"date_assigned": to_date(r.get("Date Assigned")),
|
||||
"assignment_user": to_str(r.get("Assignment User")),
|
||||
"dispensation_status": to_str(r.get("Dispensation Status")),
|
||||
"dispensing_date": to_date(r.get("Dispensing date") or r.get("Dispensing Date")),
|
||||
"quantity_dispensed": to_int(r.get("Quantity Dispensed")),
|
||||
"dispensing_user": to_str(r.get("Dispensing User")),
|
||||
"quantity_returned": to_int(r.get("Quantity Returned")),
|
||||
"date_returned": to_date(r.get("Date Returned")),
|
||||
"return_user": to_str(r.get("Return User")),
|
||||
})
|
||||
by_id = {r["_id"]: r for r in rows}
|
||||
return list(by_id.values())
|
||||
|
||||
|
||||
def parse_destruction_files(study):
|
||||
dest_dir = os.path.join(BASE_DIR, f"xls_ip_destruction_{study}")
|
||||
files = sorted(glob.glob(os.path.join(dest_dir, "ip_destruction_basket_*.xlsx")))
|
||||
rows = []
|
||||
for path in files:
|
||||
raw = pd.read_excel(path, header=None)
|
||||
meta = {}
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
first = str(row.iloc[0]).strip() if pd.notna(row.iloc[0]) else ""
|
||||
for key, attr in [
|
||||
("Investigator Name:", "investigator"),
|
||||
("Site ID:", "site_id"),
|
||||
("Location:", "location"),
|
||||
("Basket ID:", "basket_id"),
|
||||
("Drug Destruction Created Date:", "destruction_date"),
|
||||
]:
|
||||
if first.startswith(key):
|
||||
meta[attr] = first.replace(key, "").strip()
|
||||
if first == "Medication ID Description" and header_row is None:
|
||||
header_row = i
|
||||
if header_row is None:
|
||||
continue
|
||||
df = pd.read_excel(path, header=header_row).dropna(how="all")
|
||||
basket_id = meta.get("basket_id")
|
||||
for _, r in df.iterrows():
|
||||
med_id = to_str(r.get("Medication ID"))
|
||||
if not med_id or not basket_id:
|
||||
continue
|
||||
rows.append({
|
||||
"_id": f"{basket_id}:{med_id}",
|
||||
"study": study,
|
||||
"site_id": meta.get("site_id"),
|
||||
"investigator": meta.get("investigator"),
|
||||
"location": meta.get("location"),
|
||||
"basket_id": basket_id,
|
||||
"destruction_date": to_date(meta.get("destruction_date")),
|
||||
"medication_description": to_str(r.get("Medication ID Description")),
|
||||
"medication_id": med_id,
|
||||
"packaged_lot_description": to_str(r.get("Packaged Lot description")),
|
||||
"comments": to_str(r.get("Comments")),
|
||||
})
|
||||
by_id = {r["_id"]: r for r in rows}
|
||||
return list(by_id.values())
|
||||
|
||||
|
||||
# ── hlavní import ────────────────────────────────────────────────────────────
|
||||
|
||||
def import_study(study):
|
||||
print(f"\n [{study}] parsovani XLSX...")
|
||||
shipments = parse_shipments_report(study)
|
||||
items = parse_shipment_details(study)
|
||||
inventory = parse_inventory(study)
|
||||
destruct = parse_destruction_files(study)
|
||||
print(f" Zasilky: {len(shipments)} | Polozky: {len(items)} | Sklad: {len(inventory)} | Destrukce: {len(destruct)}")
|
||||
|
||||
import_id = log_import(study, f"drugs_{study}", "drugs", {
|
||||
"shipments": len(shipments),
|
||||
"shipment_items": len(items),
|
||||
"inventory": len(inventory),
|
||||
"destruction": len(destruct),
|
||||
})
|
||||
print(f" import_id = {import_id}")
|
||||
|
||||
bulk_upsert_with_snapshot("iwrs_shipments", "iwrs_shipments_snapshots", shipments, import_id)
|
||||
bulk_upsert_with_snapshot("iwrs_shipment_items", "iwrs_shipment_items_snapshots", items, import_id)
|
||||
bulk_upsert_with_snapshot("iwrs_inventory", "iwrs_inventory_snapshots", inventory, import_id)
|
||||
bulk_upsert_only("iwrs_destruction", destruct, import_id)
|
||||
|
||||
|
||||
def run(studies):
|
||||
ensure_indexes()
|
||||
for s in studies:
|
||||
import_study(s)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
studies = sys.argv[1:] if len(sys.argv) > 1 else ["77242113UCO3001", "42847922MDD3003"]
|
||||
run(studies)
|
||||
@@ -0,0 +1,245 @@
|
||||
"""
|
||||
Kompletní pipeline pro Drugs:
|
||||
1. Onsite inventory detail (per site, vždy přepisuje)
|
||||
2. IP destruction (per košík, přeskočí již existující soubory)
|
||||
3. Shipments report (jeden soubor na studii, přepisuje)
|
||||
4. Shipment details (per zásilka CZ, vždy přepisuje)
|
||||
5. Import do MongoDB (studie.iwrs_shipments / iwrs_shipment_items / iwrs_inventory / iwrs_destruction)
|
||||
|
||||
Spusť tento skript — zpracuje obě studie automaticky.
|
||||
"""
|
||||
|
||||
import os
|
||||
import glob
|
||||
import re
|
||||
import datetime
|
||||
|
||||
import sys
|
||||
import pandas as pd
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
import import_to_mongo as drugs_mongo
|
||||
|
||||
BASE_URL = "https://janssen.4gclinical.com"
|
||||
EMAIL = "vbuzalka@its.jnj.com"
|
||||
PASSWORD = "Vlado123++-+"
|
||||
|
||||
STUDIES = ["77242113UCO3001", "42847922MDD3003"]
|
||||
|
||||
SITES = {
|
||||
"77242113UCO3001": [
|
||||
"DD5-CZ10001", "DD5-CZ10003", "DD5-CZ10006", "DD5-CZ10009",
|
||||
"DD5-CZ10010", "DD5-CZ10012", "DD5-CZ10013", "DD5-CZ10015",
|
||||
"DD5-CZ10016", "DD5-CZ10020", "DD5-CZ10021", "DD5-CZ10022",
|
||||
],
|
||||
"42847922MDD3003": [
|
||||
"S10-CZ10002", "S10-CZ10004", "S10-CZ10005",
|
||||
"S10-CZ10008", "S10-CZ10011", "S10-CZ10012",
|
||||
],
|
||||
}
|
||||
|
||||
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
|
||||
|
||||
|
||||
# ── login ────────────────────────────────────────────────────────────────────
|
||||
|
||||
def login(page, study):
|
||||
page.goto(BASE_URL)
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Email *").fill(EMAIL)
|
||||
page.get_by_label("Password *").fill(PASSWORD)
|
||||
page.locator("#login__submit").click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Study *").click()
|
||||
page.get_by_role("option", name=study).click()
|
||||
page.get_by_role("button", name="SELECT").click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
|
||||
|
||||
# ── download funkce ──────────────────────────────────────────────────────────
|
||||
|
||||
def download_inventory(page, study):
|
||||
out_dir = os.path.join(BASE_DIR, f"xls_reports_{study}")
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
|
||||
page.goto(f"{BASE_URL}/report/onsite_inventory_detail")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
for site_id in SITES[study]:
|
||||
print(f" [{site_id}] inventory...")
|
||||
page.locator('input[placeholder="search"], input[type="text"]').first.click()
|
||||
page.get_by_role("option", name=site_id).click()
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
filename = os.path.join(out_dir, f"onsite_inventory_detail_{site_id}.xlsx")
|
||||
with page.expect_download(timeout=120000) as dl:
|
||||
page.get_by_role("button", name="Download XLS").click()
|
||||
dl.value.save_as(filename)
|
||||
|
||||
page.get_by_role("button", name="Clear").click()
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
print(f" Inventory OK ({len(SITES[study])} center)")
|
||||
|
||||
|
||||
def download_destruction(page, study):
|
||||
out_dir = os.path.join(BASE_DIR, f"xls_ip_destruction_{study}")
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
|
||||
page.goto(f"{BASE_URL}/report/ip_destruction_form")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
page.locator('input[placeholder="search"], input[type="text"]').first.click()
|
||||
page.wait_for_timeout(1000)
|
||||
baskets = [b.strip() for b in page.locator("mat-option").all_inner_texts()
|
||||
if b.strip() and b.strip() != "No results found"]
|
||||
page.keyboard.press("Escape")
|
||||
page.wait_for_timeout(500)
|
||||
|
||||
if not baskets:
|
||||
print(" Žádné destruction košíky")
|
||||
return
|
||||
|
||||
new_count = 0
|
||||
for basket in baskets:
|
||||
filename = os.path.join(out_dir, f"ip_destruction_basket_{basket}.xlsx")
|
||||
if os.path.exists(filename):
|
||||
continue # destrukce se nemění — přeskočit
|
||||
print(f" [košík {basket}] stahování...")
|
||||
input_field = page.locator('input[placeholder="search"], input[type="text"]').first
|
||||
input_field.click()
|
||||
input_field.fill(basket)
|
||||
page.wait_for_timeout(500)
|
||||
page.locator("mat-option").first.dispatch_event("click")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
with page.expect_download(timeout=120000) as dl:
|
||||
page.get_by_role("button", name="Download XLS").click()
|
||||
dl.value.save_as(filename)
|
||||
new_count += 1
|
||||
|
||||
page.get_by_role("button", name="Clear").click()
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
print(f" Destruction OK ({new_count} nových, {len(baskets) - new_count} přeskočeno)")
|
||||
|
||||
|
||||
def download_shipments_report(page, study):
|
||||
out_dir = os.path.join(BASE_DIR, f"xls_shipments_{study}")
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
|
||||
page.goto(f"{BASE_URL}/report/shipments_report")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
filename = os.path.join(out_dir, f"shipments_report_{study}.xlsx")
|
||||
with page.expect_download(timeout=120000) as dl:
|
||||
page.get_by_role("button", name="Download XLS").click()
|
||||
dl.value.save_as(filename)
|
||||
print(f" Shipments report OK")
|
||||
|
||||
|
||||
def download_shipment_details(page, study):
|
||||
out_dir = os.path.join(BASE_DIR, f"xls_shipment_details_{study}")
|
||||
os.makedirs(out_dir, exist_ok=True)
|
||||
|
||||
# načti CZ shipment IDs z právě staženého shipments reportu
|
||||
report_path = os.path.join(BASE_DIR, f"xls_shipments_{study}", f"shipments_report_{study}.xlsx")
|
||||
raw = pd.read_excel(report_path, header=None)
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
if "Shipment ID" in [str(v).strip() for v in row]:
|
||||
header_row = i
|
||||
break
|
||||
df = pd.read_excel(report_path, header=header_row)
|
||||
df = df.dropna(how="all")
|
||||
df = df[df["Location"].astype(str).str.contains("Czech", na=False, case=False)]
|
||||
cz_shipments = list(zip(
|
||||
df["Shipment ID"].astype(str).str.strip(),
|
||||
df["IRT Shipment Status"].astype(str).str.strip() if "IRT Shipment Status" in df.columns else [""] * len(df),
|
||||
))
|
||||
print(f" CZ zásilek ke stažení: {len(cz_shipments)}")
|
||||
|
||||
page.goto(f"{BASE_URL}/report/shipment_details_report")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
skipped = 0
|
||||
for shipment, status in cz_shipments:
|
||||
filename = os.path.join(out_dir, f"shipment_details_{shipment}.xlsx")
|
||||
if os.path.exists(filename) and status.upper() == "RECEIVED":
|
||||
skipped += 1
|
||||
continue # finální stav, soubor se nemění
|
||||
input_field = page.locator('input[placeholder="search"], input[type="text"]').first
|
||||
input_field.click()
|
||||
input_field.fill(shipment)
|
||||
page.wait_for_timeout(500)
|
||||
page.locator("mat-option").first.dispatch_event("click")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
with page.expect_download(timeout=120000) as dl:
|
||||
page.get_by_role("button", name="Download XLS").click()
|
||||
dl.value.save_as(filename)
|
||||
print(f" [{shipment}] ({status}) OK")
|
||||
|
||||
page.get_by_role("button", name="Clear").click()
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
print(f" Přeskočeno (RECEIVED): {skipped}")
|
||||
|
||||
|
||||
# ── main ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
os.chdir(BASE_DIR)
|
||||
|
||||
# ── Stahování ────────────────────────────────────────────────────────────
|
||||
with sync_playwright() as p:
|
||||
for study in STUDIES:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"[{study}] STAHOVÁNÍ")
|
||||
print(f"{'='*60}")
|
||||
|
||||
browser = p.chromium.launch(headless=False)
|
||||
context = browser.new_context(accept_downloads=True)
|
||||
page = context.new_page()
|
||||
|
||||
try:
|
||||
print(" Přihlášení...")
|
||||
login(page, study)
|
||||
|
||||
print("\n [1/4] Onsite inventory...")
|
||||
download_inventory(page, study)
|
||||
|
||||
print("\n [2/4] IP destruction...")
|
||||
download_destruction(page, study)
|
||||
|
||||
print("\n [3/4] Shipments report...")
|
||||
download_shipments_report(page, study)
|
||||
|
||||
print("\n [4/4] Shipment details (CZ)...")
|
||||
download_shipment_details(page, study)
|
||||
|
||||
except Exception as e:
|
||||
import traceback
|
||||
print(f" CHYBA při stahování: {e}")
|
||||
traceback.print_exc()
|
||||
finally:
|
||||
browser.close()
|
||||
|
||||
# ── Import do MongoDB ─────────────────────────────────────────────────────
|
||||
print(f"\n{'='*60}")
|
||||
print("IMPORT DO MongoDB")
|
||||
print(f"{'='*60}")
|
||||
|
||||
try:
|
||||
drugs_mongo.run(STUDIES)
|
||||
except Exception as e:
|
||||
import traceback
|
||||
print(f" CHYBA při importu: {e}")
|
||||
traceback.print_exc()
|
||||
|
||||
print(f"\n{'='*60}")
|
||||
print("Vše hotovo.")
|
||||
print(f"{'='*60}")
|
||||
|
||||
|
||||
main()
|
||||
@@ -0,0 +1,139 @@
|
||||
import mysql.connector
|
||||
import db_config
|
||||
|
||||
conn = mysql.connector.connect(
|
||||
host=db_config.DB_HOST, port=db_config.DB_PORT,
|
||||
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
|
||||
database=db_config.DB_NAME
|
||||
)
|
||||
c = conn.cursor()
|
||||
|
||||
# Přidat report_type do iwrs_import (pokud ještě neexistuje)
|
||||
try:
|
||||
c.execute("""ALTER TABLE iwrs_import
|
||||
ADD COLUMN report_type VARCHAR(20) NOT NULL DEFAULT 'patients'
|
||||
AFTER source_file""")
|
||||
print("ALTER TABLE iwrs_import OK — report_type přidán")
|
||||
except mysql.connector.errors.DatabaseError as e:
|
||||
if "Duplicate column" in str(e):
|
||||
print("report_type již existuje — přeskočeno")
|
||||
else:
|
||||
raise
|
||||
|
||||
stmts = [
|
||||
(
|
||||
"iwrs_shipments",
|
||||
"""CREATE TABLE IF NOT EXISTS iwrs_shipments (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
import_id INT NOT NULL,
|
||||
study VARCHAR(20) NOT NULL,
|
||||
shipment_id VARCHAR(20) NOT NULL,
|
||||
status VARCHAR(50),
|
||||
type VARCHAR(30),
|
||||
ship_from VARCHAR(50),
|
||||
ship_to_site VARCHAR(50),
|
||||
location VARCHAR(50),
|
||||
request_date DATE,
|
||||
shipped_date DATE,
|
||||
received_date DATE,
|
||||
received_by VARCHAR(100),
|
||||
delivered_date_utc DATE,
|
||||
delivery_recipient VARCHAR(100),
|
||||
delivery_details VARCHAR(200),
|
||||
cancelled_date DATE,
|
||||
total_medication_ids SMALLINT,
|
||||
tracking_no VARCHAR(100),
|
||||
shipping_category VARCHAR(50),
|
||||
expected_arrival DATE,
|
||||
FOREIGN KEY (import_id) REFERENCES iwrs_import(import_id),
|
||||
INDEX idx_import (import_id),
|
||||
INDEX idx_study_shipment (study, shipment_id)
|
||||
)"""
|
||||
),
|
||||
(
|
||||
"iwrs_shipment_items",
|
||||
"""CREATE TABLE IF NOT EXISTS iwrs_shipment_items (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
import_id INT NOT NULL,
|
||||
study VARCHAR(20) NOT NULL,
|
||||
shipment_id VARCHAR(20) NOT NULL,
|
||||
destination_location VARCHAR(50),
|
||||
shipment_status VARCHAR(50),
|
||||
shipment_type VARCHAR(30),
|
||||
destination_site VARCHAR(50),
|
||||
investigator VARCHAR(100),
|
||||
medication_description VARCHAR(200),
|
||||
medication_type VARCHAR(50),
|
||||
medication_id VARCHAR(20),
|
||||
packaged_lot_no VARCHAR(50),
|
||||
packaged_lot_description VARCHAR(100),
|
||||
container_id VARCHAR(50),
|
||||
quantity SMALLINT,
|
||||
expiration_date DATE,
|
||||
item_status VARCHAR(50),
|
||||
FOREIGN KEY (import_id) REFERENCES iwrs_import(import_id),
|
||||
INDEX idx_import (import_id),
|
||||
INDEX idx_med_id (medication_id)
|
||||
)"""
|
||||
),
|
||||
(
|
||||
"iwrs_inventory",
|
||||
"""CREATE TABLE IF NOT EXISTS iwrs_inventory (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
import_id INT NOT NULL,
|
||||
study VARCHAR(20) NOT NULL,
|
||||
site VARCHAR(50),
|
||||
investigator VARCHAR(100),
|
||||
location VARCHAR(50),
|
||||
medication_id VARCHAR(20),
|
||||
packaged_lot_no VARCHAR(50),
|
||||
original_expiration_date DATE,
|
||||
expiration_date DATE,
|
||||
received_date DATE,
|
||||
receipt_user VARCHAR(100),
|
||||
subject_identifier VARCHAR(20),
|
||||
quantity_assigned SMALLINT,
|
||||
irt_transaction VARCHAR(100),
|
||||
date_assigned DATE,
|
||||
assignment_user VARCHAR(100),
|
||||
dispensation_status VARCHAR(50),
|
||||
dispensing_date DATE,
|
||||
quantity_dispensed SMALLINT,
|
||||
dispensing_user VARCHAR(100),
|
||||
quantity_returned SMALLINT,
|
||||
date_returned DATE,
|
||||
return_user VARCHAR(100),
|
||||
FOREIGN KEY (import_id) REFERENCES iwrs_import(import_id),
|
||||
INDEX idx_import (import_id),
|
||||
INDEX idx_site (study, site)
|
||||
)"""
|
||||
),
|
||||
(
|
||||
"iwrs_destruction",
|
||||
"""CREATE TABLE IF NOT EXISTS iwrs_destruction (
|
||||
id INT AUTO_INCREMENT PRIMARY KEY,
|
||||
study VARCHAR(20) NOT NULL,
|
||||
site_id VARCHAR(50),
|
||||
investigator VARCHAR(100),
|
||||
location VARCHAR(50),
|
||||
basket_id VARCHAR(20) NOT NULL,
|
||||
destruction_date DATE,
|
||||
medication_description VARCHAR(200),
|
||||
medication_id VARCHAR(20),
|
||||
packaged_lot_description VARCHAR(100),
|
||||
comments VARCHAR(500),
|
||||
imported_at DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
|
||||
UNIQUE KEY uq_destruction (study, basket_id, medication_id),
|
||||
INDEX idx_study_basket (study, basket_id)
|
||||
)"""
|
||||
),
|
||||
]
|
||||
|
||||
for name, sql in stmts:
|
||||
c.execute(sql)
|
||||
print(f"OK: {name}")
|
||||
|
||||
conn.commit()
|
||||
c.close()
|
||||
conn.close()
|
||||
print("\nVšechny tabulky připraveny.")
|
||||
@@ -0,0 +1,364 @@
|
||||
import sys
|
||||
import os
|
||||
import mysql.connector
|
||||
import pandas as pd
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
from openpyxl import load_workbook
|
||||
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
|
||||
from openpyxl.utils import get_column_letter
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), ".."))
|
||||
import db_config
|
||||
|
||||
STUDY = "42847922MDD3003"
|
||||
# STUDY = "77242113UCO3001"
|
||||
|
||||
BASE_DIR = Path(os.path.dirname(os.path.abspath(__file__)))
|
||||
OUTPUT_DIR = BASE_DIR / "output"
|
||||
OUTPUT_FILE = OUTPUT_DIR / f"{date.today().strftime('%Y-%m-%d')} {STUDY} CZ IWRS overview.xlsx"
|
||||
|
||||
DATE_COLUMNS = {
|
||||
"Orig Exp Date", "Exp Date", "Rcv Date",
|
||||
"Date Asgn", "Disp Date", "Date Ret", "Destroyed", "Max Visit Date",
|
||||
}
|
||||
|
||||
COLUMN_WIDTHS = {
|
||||
"Site": 14,
|
||||
"Med ID": 10,
|
||||
"Lot No.": 12,
|
||||
"Orig Exp Date": 16,
|
||||
"Exp Date": 14,
|
||||
"Rcv Date": 14,
|
||||
"Rcpt User": 22,
|
||||
"Subject ID": 14,
|
||||
"Qty Asgn": 9,
|
||||
"IRT Tx": 8,
|
||||
"Date Asgn": 14,
|
||||
"Asgn User": 20,
|
||||
"Disp Status": 16,
|
||||
"Disp Date": 14,
|
||||
"Qty Disp": 9,
|
||||
"Disp User": 20,
|
||||
"Qty Ret": 10,
|
||||
"Date Ret": 14,
|
||||
"Ret User": 18,
|
||||
"Destroyed": 14,
|
||||
"Basket No.": 12,
|
||||
"Max Visit Date": 16,
|
||||
}
|
||||
|
||||
# shipments sheet: kolík kde začínají detail sloupce (1-based, pro format_shipment_sheet)
|
||||
N_SHIP_COLS = 9
|
||||
|
||||
|
||||
# ── DB ────────────────────────────────────────────────────────────────────────
|
||||
|
||||
def get_conn():
|
||||
return mysql.connector.connect(
|
||||
host=db_config.DB_HOST, port=db_config.DB_PORT,
|
||||
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
|
||||
database=db_config.DB_NAME,
|
||||
)
|
||||
|
||||
|
||||
def get_latest_import_id(cursor, study):
|
||||
cursor.execute(
|
||||
"SELECT MAX(import_id) AS mid FROM iwrs_import WHERE study=%s AND report_type='drugs'",
|
||||
(study,),
|
||||
)
|
||||
row = cursor.fetchone()
|
||||
mid = row["mid"]
|
||||
if mid is None:
|
||||
raise RuntimeError(f"Žádná data v MySQL pro studii {study}")
|
||||
return mid
|
||||
|
||||
|
||||
# ── Načítání dat z MySQL ──────────────────────────────────────────────────────
|
||||
|
||||
def load_inventory(cursor, study, import_id):
|
||||
"""
|
||||
Vrátí DataFrame s inventory + destruction join.
|
||||
Sloupce jsou rovnou přejmenované pro downstream funkce.
|
||||
"""
|
||||
sql = """
|
||||
SELECT
|
||||
i.site AS Site,
|
||||
i.medication_id AS `Med ID`,
|
||||
i.packaged_lot_no AS `Lot No.`,
|
||||
i.original_expiration_date AS `Orig Exp Date`,
|
||||
i.expiration_date AS `Exp Date`,
|
||||
i.received_date AS `Rcv Date`,
|
||||
i.receipt_user AS `Rcpt User`,
|
||||
i.subject_identifier AS `Subject ID`,
|
||||
i.quantity_assigned AS `Qty Asgn`,
|
||||
i.irt_transaction AS `IRT Tx`,
|
||||
i.date_assigned AS `Date Asgn`,
|
||||
i.assignment_user AS `Asgn User`,
|
||||
i.dispensation_status AS `Disp Status`,
|
||||
i.dispensing_date AS `Disp Date`,
|
||||
i.quantity_dispensed AS `Qty Disp`,
|
||||
i.dispensing_user AS `Disp User`,
|
||||
i.quantity_returned AS `Qty Ret`,
|
||||
i.date_returned AS `Date Ret`,
|
||||
i.return_user AS `Ret User`,
|
||||
d.destruction_date AS Destroyed,
|
||||
d.basket_id AS `Basket No.`
|
||||
FROM iwrs_inventory i
|
||||
LEFT JOIN (
|
||||
SELECT medication_id,
|
||||
ANY_VALUE(basket_id) AS basket_id,
|
||||
ANY_VALUE(destruction_date) AS destruction_date
|
||||
FROM iwrs_destruction
|
||||
WHERE study = %s
|
||||
GROUP BY medication_id
|
||||
) d ON d.medication_id = i.medication_id
|
||||
WHERE i.import_id = %s
|
||||
AND i.study = %s
|
||||
ORDER BY i.site, i.received_date, i.medication_id
|
||||
"""
|
||||
cursor.execute(sql, (study, import_id, study))
|
||||
rows = cursor.fetchall()
|
||||
df = pd.DataFrame(rows)
|
||||
for col in DATE_COLUMNS:
|
||||
if col in df.columns:
|
||||
df[col] = pd.to_datetime(df[col], errors="coerce")
|
||||
print(f" Inventory: {len(df)} kitu")
|
||||
return df
|
||||
|
||||
|
||||
def load_shipments(cursor, study, import_id):
|
||||
"""
|
||||
Vrátí DataFrame se spojenými shipments + items.
|
||||
"""
|
||||
sql = """
|
||||
SELECT
|
||||
s.shipment_id AS `Shipment ID`,
|
||||
s.status AS `IRT Shipment Status`,
|
||||
s.type AS Type,
|
||||
s.ship_from AS `Shipment From`,
|
||||
s.ship_to_site AS `Ship To:`,
|
||||
s.request_date AS `Request Date`,
|
||||
s.received_date AS `Received Date`,
|
||||
s.received_by AS `Received by`,
|
||||
s.expected_arrival AS `Expected Arrival`,
|
||||
i.investigator AS Investigator,
|
||||
i.medication_description AS `Medication Description`,
|
||||
i.medication_id AS `Medication ID`,
|
||||
i.packaged_lot_no AS `Packaged Lot number`,
|
||||
i.expiration_date AS `Expiration Date`,
|
||||
i.item_status AS Status
|
||||
FROM iwrs_shipments s
|
||||
JOIN iwrs_shipment_items i
|
||||
ON i.study = s.study
|
||||
AND i.shipment_id = s.shipment_id
|
||||
AND i.import_id = %s
|
||||
WHERE s.import_id = %s
|
||||
AND s.study = %s
|
||||
ORDER BY s.ship_to_site, s.shipment_id, i.medication_id
|
||||
"""
|
||||
cursor.execute(sql, (import_id, import_id, study))
|
||||
rows = cursor.fetchall()
|
||||
df = pd.DataFrame(rows)
|
||||
for col in ("Request Date", "Received Date", "Expiration Date", "Expected Arrival"):
|
||||
if col in df.columns:
|
||||
df[col] = pd.to_datetime(df[col], errors="coerce")
|
||||
print(f" Shipments: {df['Shipment ID'].nunique() if len(df) else 0} zásilek, {len(df)} kitu")
|
||||
return df
|
||||
|
||||
|
||||
# ── Odvozené sheety ───────────────────────────────────────────────────────────
|
||||
|
||||
def build_site_summary(shipments_df):
|
||||
STATUS_COLS = ["Available", "Assigned", "Dispensed", "Returned by Subject"]
|
||||
pivot = shipments_df.groupby("Ship To:")["Status"].value_counts().unstack(fill_value=0)
|
||||
for s in STATUS_COLS:
|
||||
if s not in pivot.columns:
|
||||
pivot[s] = 0
|
||||
pivot = (
|
||||
pivot[STATUS_COLS]
|
||||
.reset_index()
|
||||
.rename(columns={"Ship To:": "Site", "Returned by Subject": "Returned"})
|
||||
.sort_values("Site")
|
||||
.reset_index(drop=True)
|
||||
)
|
||||
pivot["Total"] = pivot[["Available", "Assigned", "Dispensed", "Returned"]].sum(axis=1)
|
||||
print(f" Site Summary: {len(pivot)} center")
|
||||
return pivot
|
||||
|
||||
|
||||
def build_expired(df):
|
||||
today = date.today()
|
||||
mask = (
|
||||
df["Basket No."].isna() &
|
||||
df["Subject ID"].isna() &
|
||||
(df["Exp Date"] < pd.Timestamp(today))
|
||||
)
|
||||
filtered = df[mask].copy().reset_index(drop=True)
|
||||
sheet_name = f"Expired as of {today.strftime('%d-%b-%Y')}"
|
||||
print(f" Expired: {len(filtered)}")
|
||||
return filtered, sheet_name
|
||||
|
||||
|
||||
def build_assigned_not_dispensed(df):
|
||||
mask = df["Subject ID"].notna() & df["Disp Date"].isna()
|
||||
filtered = df[mask].copy().reset_index(drop=True)
|
||||
print(f" Assigned not dispensed: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
def build_not_returned(df):
|
||||
no_ret = df[
|
||||
df["Date Ret"].isna() &
|
||||
df["Subject ID"].notna() &
|
||||
(df["Disp Status"].fillna("").str.upper() != "NOT DISPENSED")
|
||||
].copy()
|
||||
max_asgn = df.groupby("Subject ID")["Date Asgn"].max().rename("Max Visit Date")
|
||||
no_ret = no_ret.join(max_asgn, on="Subject ID")
|
||||
filtered = no_ret[no_ret["Date Asgn"] < no_ret["Max Visit Date"]].copy()
|
||||
filtered = filtered.drop(columns=["Qty Ret", "Date Ret", "Ret User", "Destroyed", "Basket No."])
|
||||
filtered = filtered.reset_index(drop=True)
|
||||
print(f" Not returned: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
def build_kits_for_destruction(df):
|
||||
mask = (
|
||||
df["Basket No."].isna() &
|
||||
(df["Date Ret"].notna() | (df["Disp Status"].fillna("").str.upper() == "NOT DISPENSED"))
|
||||
)
|
||||
filtered = (
|
||||
df[mask]
|
||||
.copy()
|
||||
.sort_values(["Site", "Date Ret"], ascending=[True, True])
|
||||
.drop(columns=["Destroyed", "Basket No."])
|
||||
.reset_index(drop=True)
|
||||
)
|
||||
print(f" Kits for destruction: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
# ── Formátování ───────────────────────────────────────────────────────────────
|
||||
|
||||
def format_sheet(ws, header_color, highlight_col=None, highlight_color=None):
|
||||
thin = Side(style="thin", color="000000")
|
||||
border = Border(left=thin, right=thin, top=thin, bottom=thin)
|
||||
header_fill = PatternFill("solid", start_color=header_color)
|
||||
header_font = Font(bold=True, color="FFFFFF", name="Arial", size=10)
|
||||
row_font = Font(name="Arial", size=10)
|
||||
hi_fill = PatternFill("solid", start_color=highlight_color) if highlight_color else None
|
||||
|
||||
headers = [cell.value for cell in ws[1]]
|
||||
|
||||
for cell in ws[1]:
|
||||
cell.fill = header_fill
|
||||
cell.font = header_font
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=False)
|
||||
cell.border = border
|
||||
|
||||
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
|
||||
for cell in row:
|
||||
col_name = headers[cell.column - 1] if cell.column <= len(headers) else None
|
||||
cell.font = row_font
|
||||
cell.border = border
|
||||
cell.alignment = Alignment(horizontal="center")
|
||||
if col_name in DATE_COLUMNS:
|
||||
cell.number_format = "DD-MMM-YYYY"
|
||||
if hi_fill and col_name == highlight_col:
|
||||
cell.fill = hi_fill
|
||||
|
||||
for cell in ws[1]:
|
||||
width = COLUMN_WIDTHS.get(cell.value, 14)
|
||||
ws.column_dimensions[get_column_letter(cell.column)].width = width
|
||||
|
||||
ws.auto_filter.ref = ws.dimensions
|
||||
ws.freeze_panes = "A2"
|
||||
|
||||
|
||||
def format_shipment_sheet(ws, header_color_ship, header_color_detail, n_ship_cols):
|
||||
thin = Side(style="thin", color="000000")
|
||||
border = Border(left=thin, right=thin, top=thin, bottom=thin)
|
||||
hfont = Font(bold=True, color="FFFFFF", name="Arial", size=10)
|
||||
dfont = Font(name="Arial", size=10)
|
||||
fill_ship = PatternFill("solid", start_color=header_color_ship)
|
||||
fill_detail = PatternFill("solid", start_color=header_color_detail)
|
||||
|
||||
for cell in ws[1]:
|
||||
cell.fill = fill_ship if cell.column <= n_ship_cols else fill_detail
|
||||
cell.font = hfont
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
|
||||
cell.border = border
|
||||
ws.column_dimensions[get_column_letter(cell.column)].width = min(
|
||||
len(str(cell.value or "")) + 4, 35
|
||||
)
|
||||
ws.row_dimensions[1].height = 30
|
||||
|
||||
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
|
||||
for cell in row:
|
||||
cell.font = dfont
|
||||
cell.border = border
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center")
|
||||
if cell.value.__class__.__name__ in ("datetime", "date", "Timestamp"):
|
||||
cell.number_format = "DD-MMM-YYYY"
|
||||
|
||||
ws.auto_filter.ref = ws.dimensions
|
||||
ws.freeze_panes = "A2"
|
||||
|
||||
|
||||
# ── Main ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
def main():
|
||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
||||
|
||||
print(f"\nNačítám data z MySQL pro {STUDY}...")
|
||||
conn = get_conn()
|
||||
cursor = conn.cursor(dictionary=True)
|
||||
import_id = get_latest_import_id(cursor, STUDY)
|
||||
print(f" import_id = {import_id}")
|
||||
|
||||
df = load_inventory(cursor, STUDY, import_id)
|
||||
shipments_df = load_shipments(cursor, STUDY, import_id)
|
||||
|
||||
cursor.close()
|
||||
conn.close()
|
||||
|
||||
expired_df, expired_sheet = build_expired(df)
|
||||
assigned_df = build_assigned_not_dispensed(df)
|
||||
not_returned_df = build_not_returned(df)
|
||||
destruction_df = build_kits_for_destruction(df)
|
||||
site_summary_df = build_site_summary(shipments_df)
|
||||
|
||||
with pd.ExcelWriter(OUTPUT_FILE, engine="openpyxl") as writer:
|
||||
df.to_excel( writer, index=False, sheet_name="CountryMedicationOverview")
|
||||
expired_df.to_excel( writer, index=False, sheet_name=expired_sheet)
|
||||
assigned_df.to_excel( writer, index=False, sheet_name="Assigned not dispensed")
|
||||
not_returned_df.to_excel( writer, index=False, sheet_name="Not returned")
|
||||
destruction_df.to_excel( writer, index=False, sheet_name="Kits for destruction")
|
||||
shipments_df.to_excel( writer, index=False, sheet_name="Shipments")
|
||||
site_summary_df.to_excel( writer, index=False, sheet_name="Site Summary")
|
||||
|
||||
wb = load_workbook(OUTPUT_FILE)
|
||||
|
||||
ws_main = wb["CountryMedicationOverview"]
|
||||
format_sheet(ws_main, header_color="1F4E79")
|
||||
new_col_fill = PatternFill("solid", start_color="E2EFDA")
|
||||
headers_main = [c.value for c in ws_main[1]]
|
||||
for row in ws_main.iter_rows(min_row=2, max_row=ws_main.max_row):
|
||||
for cell in row:
|
||||
col_name = headers_main[cell.column - 1] if cell.column <= len(headers_main) else None
|
||||
if col_name in ("Destroyed", "Basket No."):
|
||||
cell.fill = new_col_fill
|
||||
|
||||
format_sheet(wb[expired_sheet], header_color="C00000", highlight_col="Exp Date", highlight_color="FFE0E0")
|
||||
format_sheet(wb["Assigned not dispensed"], header_color="833C00", highlight_col="Subject ID", highlight_color="FFF2CC")
|
||||
format_sheet(wb["Not returned"], header_color="375623", highlight_col="Max Visit Date", highlight_color="E2EFDA")
|
||||
format_sheet(wb["Kits for destruction"], header_color="595959")
|
||||
format_shipment_sheet(wb["Shipments"], "1F4E79", "375623", N_SHIP_COLS)
|
||||
format_sheet(wb["Site Summary"], header_color="1F4E79")
|
||||
|
||||
wb.save(OUTPUT_FILE)
|
||||
print(f"\nUloženo: {OUTPUT_FILE} ({len(df)} řádků, sheety: {wb.sheetnames})")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,205 @@
|
||||
import sys
|
||||
import os
|
||||
import mysql.connector
|
||||
import openpyxl
|
||||
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
|
||||
from openpyxl.utils import get_column_letter
|
||||
from datetime import date
|
||||
import pandas as pd
|
||||
|
||||
# db_config.py je v nadřazeném adresáři (Drugs/)
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), ".."))
|
||||
import db_config
|
||||
|
||||
STUDY = "77242113UCO3001"
|
||||
OUTPUT_DIR = os.path.join(os.path.dirname(os.path.abspath(__file__)), "output")
|
||||
|
||||
os.makedirs(OUTPUT_DIR, exist_ok=True)
|
||||
|
||||
|
||||
def get_conn():
|
||||
return mysql.connector.connect(
|
||||
host=db_config.DB_HOST, port=db_config.DB_PORT,
|
||||
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
|
||||
database=db_config.DB_NAME,
|
||||
)
|
||||
|
||||
|
||||
def load_data(study):
|
||||
conn = get_conn()
|
||||
cursor = conn.cursor(dictionary=True)
|
||||
|
||||
# nejnovější import_id pro danou studii
|
||||
cursor.execute(
|
||||
"SELECT MAX(import_id) AS mid FROM iwrs_import WHERE study=%s AND report_type='drugs'",
|
||||
(study,),
|
||||
)
|
||||
row = cursor.fetchone()
|
||||
import_id = row["mid"]
|
||||
if import_id is None:
|
||||
raise RuntimeError(f"Žádná data v MySQL pro studii {study}")
|
||||
print(f" import_id = {import_id}")
|
||||
|
||||
sql = """
|
||||
SELECT
|
||||
s.shipment_id,
|
||||
s.status AS irt_shipment_status,
|
||||
s.type,
|
||||
s.ship_from AS shipment_from,
|
||||
s.ship_to_site AS ship_to,
|
||||
s.request_date,
|
||||
s.received_date,
|
||||
s.received_by,
|
||||
s.expected_arrival,
|
||||
i.investigator,
|
||||
i.medication_description,
|
||||
i.medication_id,
|
||||
i.packaged_lot_no,
|
||||
i.expiration_date,
|
||||
i.item_status AS status
|
||||
FROM iwrs_shipments s
|
||||
JOIN iwrs_shipment_items i
|
||||
ON i.study = s.study
|
||||
AND i.shipment_id = s.shipment_id
|
||||
AND i.import_id = %s
|
||||
WHERE s.import_id = %s
|
||||
AND s.study = %s
|
||||
ORDER BY s.ship_to_site, s.shipment_id, i.medication_id
|
||||
"""
|
||||
cursor.execute(sql, (import_id, import_id, study))
|
||||
rows = cursor.fetchall()
|
||||
cursor.close()
|
||||
conn.close()
|
||||
print(f" Načteno řádků: {len(rows)}")
|
||||
return rows
|
||||
|
||||
|
||||
# shipment sloupce (modrý header) / detail sloupce (zelený header)
|
||||
SHIP_COLS = [
|
||||
("shipment_id", "Shipment ID"),
|
||||
("irt_shipment_status","IRT Shipment Status"),
|
||||
("type", "Type"),
|
||||
("shipment_from", "Shipment From"),
|
||||
("ship_to", "Ship To:"),
|
||||
("request_date", "Request Date"),
|
||||
("received_date", "Received Date"),
|
||||
("received_by", "Received by"),
|
||||
("expected_arrival", "Expected Arrival"),
|
||||
]
|
||||
|
||||
DETAIL_COLS = [
|
||||
("investigator", "Investigator"),
|
||||
("medication_description", "Medication Description"),
|
||||
("medication_id", "Medication ID"),
|
||||
("packaged_lot_no", "Packaged Lot number"),
|
||||
("expiration_date", "Expiration Date"),
|
||||
("status", "Status"),
|
||||
]
|
||||
|
||||
ALL_COLS = SHIP_COLS + DETAIL_COLS
|
||||
N_SHIP_COLS = len(SHIP_COLS)
|
||||
|
||||
HEADER_FILL_SHIP = PatternFill("solid", fgColor="1F4E79")
|
||||
HEADER_FILL_DETAIL = PatternFill("solid", fgColor="375623")
|
||||
HEADER_FONT = Font(name="Arial", bold=True, color="FFFFFF", size=10)
|
||||
DATA_FONT = Font(name="Arial", size=10)
|
||||
THIN_BORDER = Border(
|
||||
left=Side(style="thin", color="BFBFBF"),
|
||||
right=Side(style="thin", color="BFBFBF"),
|
||||
bottom=Side(style="thin", color="BFBFBF"),
|
||||
)
|
||||
|
||||
|
||||
def write_shipments_sheet(wb, rows):
|
||||
ws = wb.active
|
||||
ws.title = "Shipments"
|
||||
|
||||
# záhlaví
|
||||
for ci, (_, label) in enumerate(ALL_COLS, 1):
|
||||
cell = ws.cell(row=1, column=ci, value=label)
|
||||
cell.font = HEADER_FONT
|
||||
cell.fill = HEADER_FILL_SHIP if ci <= N_SHIP_COLS else HEADER_FILL_DETAIL
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
|
||||
cell.border = THIN_BORDER
|
||||
ws.row_dimensions[1].height = 30
|
||||
|
||||
# data
|
||||
for ri, row in enumerate(rows, 2):
|
||||
for ci, (key, _) in enumerate(ALL_COLS, 1):
|
||||
val = row[key]
|
||||
cell = ws.cell(row=ri, column=ci, value=val)
|
||||
cell.font = DATA_FONT
|
||||
cell.border = THIN_BORDER
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center")
|
||||
if isinstance(val, date):
|
||||
cell.number_format = "DD-MMM-YYYY"
|
||||
|
||||
ws.auto_filter.ref = ws.dimensions
|
||||
ws.freeze_panes = "A2"
|
||||
|
||||
# šířky sloupců
|
||||
for ci, (key, label) in enumerate(ALL_COLS, 1):
|
||||
vals = [label] + [str(r[key]) for r in rows if r[key] is not None]
|
||||
ws.column_dimensions[get_column_letter(ci)].width = min(
|
||||
max((len(v) for v in vals), default=10) + 2, 35
|
||||
)
|
||||
|
||||
|
||||
def write_summary_sheet(wb, rows):
|
||||
STATUS_COLS = ["Available", "Assigned", "Dispensed", "Returned by Subject"]
|
||||
|
||||
df = pd.DataFrame(rows)
|
||||
pivot = df.groupby("ship_to")["status"].value_counts().unstack(fill_value=0)
|
||||
for s in STATUS_COLS:
|
||||
if s not in pivot.columns:
|
||||
pivot[s] = 0
|
||||
pivot = (
|
||||
pivot[STATUS_COLS]
|
||||
.reset_index()
|
||||
.rename(columns={"ship_to": "Site", "Returned by Subject": "Returned"})
|
||||
.sort_values("Site")
|
||||
.reset_index(drop=True)
|
||||
)
|
||||
pivot["Total"] = pivot[["Available", "Assigned", "Dispensed", "Returned"]].sum(axis=1)
|
||||
|
||||
ws = wb.create_sheet("Site Summary")
|
||||
s_cols = ["Site", "Available", "Assigned", "Dispensed", "Returned", "Total"]
|
||||
|
||||
for ci, col in enumerate(s_cols, 1):
|
||||
cell = ws.cell(row=1, column=ci, value=col)
|
||||
cell.font = HEADER_FONT
|
||||
cell.fill = PatternFill("solid", fgColor="1F4E79")
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center")
|
||||
cell.border = THIN_BORDER
|
||||
ws.row_dimensions[1].height = 25
|
||||
|
||||
for ri, (_, row) in enumerate(pivot.iterrows(), 2):
|
||||
for ci, col in enumerate(s_cols, 1):
|
||||
cell = ws.cell(row=ri, column=ci, value=row[col])
|
||||
cell.font = DATA_FONT
|
||||
cell.border = THIN_BORDER
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center")
|
||||
|
||||
for ci, col in enumerate(s_cols, 1):
|
||||
vals = [col] + [str(pivot.iloc[r][col]) for r in range(len(pivot))]
|
||||
ws.column_dimensions[get_column_letter(ci)].width = min(
|
||||
max(len(v) for v in vals) + 4, 35
|
||||
)
|
||||
|
||||
ws.freeze_panes = "A2"
|
||||
|
||||
|
||||
def build_report():
|
||||
print(f"\nNačítám data z MySQL pro {STUDY}...")
|
||||
rows = load_data(STUDY)
|
||||
|
||||
wb = openpyxl.Workbook()
|
||||
write_shipments_sheet(wb, rows)
|
||||
write_summary_sheet(wb, rows)
|
||||
|
||||
outfile = os.path.join(OUTPUT_DIR, f"{date.today()} {STUDY} CZ Shipments.xlsx")
|
||||
wb.save(outfile)
|
||||
print(f"\nUloženo -> {outfile}")
|
||||
|
||||
|
||||
build_report()
|
||||
@@ -0,0 +1,393 @@
|
||||
import sys
|
||||
import os
|
||||
import mysql.connector
|
||||
import pandas as pd
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
from openpyxl import load_workbook
|
||||
from openpyxl.styles import Font, PatternFill, Alignment, Border, Side
|
||||
from openpyxl.utils import get_column_letter
|
||||
|
||||
sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), ".."))
|
||||
import db_config
|
||||
|
||||
STUDIES = [
|
||||
("77242113UCO3001", "UCO"),
|
||||
("42847922MDD3003", "MDD"),
|
||||
]
|
||||
|
||||
BASE_DIR = Path(os.path.dirname(os.path.abspath(__file__)))
|
||||
OUTPUT_DIR = BASE_DIR / "output"
|
||||
|
||||
DATE_COLUMNS = {
|
||||
"Orig Exp Date", "Exp Date", "Rcv Date",
|
||||
"Date Asgn", "Disp Date", "Date Ret", "Destroyed", "Max Visit Date",
|
||||
}
|
||||
|
||||
COLUMN_WIDTHS = {
|
||||
"Site": 14,
|
||||
"Med ID": 10,
|
||||
"Lot No.": 12,
|
||||
"Orig Exp Date": 16,
|
||||
"Exp Date": 14,
|
||||
"Rcv Date": 14,
|
||||
"Rcpt User": 22,
|
||||
"Subject ID": 14,
|
||||
"Qty Asgn": 9,
|
||||
"IRT Tx": 8,
|
||||
"Date Asgn": 14,
|
||||
"Asgn User": 20,
|
||||
"Disp Status": 16,
|
||||
"Disp Date": 14,
|
||||
"Qty Disp": 9,
|
||||
"Disp User": 20,
|
||||
"Qty Ret": 10,
|
||||
"Date Ret": 14,
|
||||
"Ret User": 18,
|
||||
"Destroyed": 14,
|
||||
"Basket No.": 12,
|
||||
"Max Visit Date": 16,
|
||||
}
|
||||
|
||||
N_SHIP_COLS = 9 # počet shipment sloupců (modrý header v Shipments sheetu)
|
||||
|
||||
|
||||
# ── DB ────────────────────────────────────────────────────────────────────────
|
||||
|
||||
def get_conn():
|
||||
return mysql.connector.connect(
|
||||
host=db_config.DB_HOST, port=db_config.DB_PORT,
|
||||
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
|
||||
database=db_config.DB_NAME,
|
||||
)
|
||||
|
||||
|
||||
def get_latest_import_id(cursor, study):
|
||||
cursor.execute(
|
||||
"SELECT MAX(import_id) AS mid FROM iwrs_import WHERE study=%s AND report_type='drugs'",
|
||||
(study,),
|
||||
)
|
||||
row = cursor.fetchone()
|
||||
mid = row["mid"]
|
||||
if mid is None:
|
||||
raise RuntimeError(f"Žádná data v MySQL pro studii {study}")
|
||||
return mid
|
||||
|
||||
|
||||
# ── Načítání dat ──────────────────────────────────────────────────────────────
|
||||
|
||||
def load_inventory(cursor, study, import_id):
|
||||
sql = """
|
||||
SELECT
|
||||
i.site AS Site,
|
||||
i.medication_id AS `Med ID`,
|
||||
i.packaged_lot_no AS `Lot No.`,
|
||||
i.original_expiration_date AS `Orig Exp Date`,
|
||||
i.expiration_date AS `Exp Date`,
|
||||
i.received_date AS `Rcv Date`,
|
||||
i.receipt_user AS `Rcpt User`,
|
||||
i.subject_identifier AS `Subject ID`,
|
||||
i.quantity_assigned AS `Qty Asgn`,
|
||||
i.irt_transaction AS `IRT Tx`,
|
||||
i.date_assigned AS `Date Asgn`,
|
||||
i.assignment_user AS `Asgn User`,
|
||||
i.dispensation_status AS `Disp Status`,
|
||||
i.dispensing_date AS `Disp Date`,
|
||||
i.quantity_dispensed AS `Qty Disp`,
|
||||
i.dispensing_user AS `Disp User`,
|
||||
i.quantity_returned AS `Qty Ret`,
|
||||
i.date_returned AS `Date Ret`,
|
||||
i.return_user AS `Ret User`,
|
||||
d.destruction_date AS Destroyed,
|
||||
d.basket_id AS `Basket No.`
|
||||
FROM iwrs_inventory i
|
||||
LEFT JOIN (
|
||||
SELECT medication_id,
|
||||
ANY_VALUE(basket_id) AS basket_id,
|
||||
ANY_VALUE(destruction_date) AS destruction_date
|
||||
FROM iwrs_destruction
|
||||
WHERE study = %s
|
||||
GROUP BY medication_id
|
||||
) d ON d.medication_id = i.medication_id
|
||||
WHERE i.import_id = %s
|
||||
AND i.study = %s
|
||||
ORDER BY i.site, i.received_date, i.medication_id
|
||||
"""
|
||||
cursor.execute(sql, (study, import_id, study))
|
||||
rows = cursor.fetchall()
|
||||
df = pd.DataFrame(rows)
|
||||
for col in DATE_COLUMNS:
|
||||
if col in df.columns:
|
||||
df[col] = pd.to_datetime(df[col], errors="coerce")
|
||||
print(f" Inventory: {len(df)} kitu")
|
||||
return df
|
||||
|
||||
|
||||
def load_shipments(cursor, study, import_id):
|
||||
sql = """
|
||||
SELECT
|
||||
s.shipment_id AS `Shipment ID`,
|
||||
s.status AS `IRT Shipment Status`,
|
||||
s.type AS Type,
|
||||
s.ship_from AS `Shipment From`,
|
||||
s.ship_to_site AS `Ship To:`,
|
||||
s.request_date AS `Request Date`,
|
||||
s.received_date AS `Received Date`,
|
||||
s.received_by AS `Received by`,
|
||||
s.expected_arrival AS `Expected Arrival`,
|
||||
i.investigator AS Investigator,
|
||||
i.medication_description AS `Medication Description`,
|
||||
i.medication_id AS `Medication ID`,
|
||||
i.packaged_lot_no AS `Packaged Lot number`,
|
||||
i.expiration_date AS `Expiration Date`,
|
||||
i.item_status AS Status
|
||||
FROM iwrs_shipments s
|
||||
JOIN iwrs_shipment_items i
|
||||
ON i.study = s.study
|
||||
AND i.shipment_id = s.shipment_id
|
||||
AND i.import_id = %s
|
||||
WHERE s.import_id = %s
|
||||
AND s.study = %s
|
||||
ORDER BY s.ship_to_site, s.shipment_id, i.medication_id
|
||||
"""
|
||||
cursor.execute(sql, (import_id, import_id, study))
|
||||
rows = cursor.fetchall()
|
||||
df = pd.DataFrame(rows)
|
||||
for col in ("Request Date", "Received Date", "Expiration Date", "Expected Arrival"):
|
||||
if col in df.columns:
|
||||
df[col] = pd.to_datetime(df[col], errors="coerce")
|
||||
n_ship = df["Shipment ID"].nunique() if len(df) else 0
|
||||
print(f" Shipments: {n_ship} zásilek, {len(df)} kitu")
|
||||
return df
|
||||
|
||||
|
||||
# ── Odvozené sheety ───────────────────────────────────────────────────────────
|
||||
|
||||
def build_site_summary(shipments_df):
|
||||
STATUS_COLS = ["Available", "Assigned", "Dispensed", "Returned by Subject"]
|
||||
pivot = shipments_df.groupby("Ship To:")["Status"].value_counts().unstack(fill_value=0)
|
||||
for s in STATUS_COLS:
|
||||
if s not in pivot.columns:
|
||||
pivot[s] = 0
|
||||
pivot = (
|
||||
pivot[STATUS_COLS]
|
||||
.reset_index()
|
||||
.rename(columns={"Ship To:": "Site", "Returned by Subject": "Returned"})
|
||||
.sort_values("Site")
|
||||
.reset_index(drop=True)
|
||||
)
|
||||
pivot["Total"] = pivot[["Available", "Assigned", "Dispensed", "Returned"]].sum(axis=1)
|
||||
print(f" Site Summary: {len(pivot)} center")
|
||||
return pivot
|
||||
|
||||
|
||||
def build_expired(df):
|
||||
today = date.today()
|
||||
mask = (
|
||||
df["Basket No."].isna() &
|
||||
df["Subject ID"].isna() &
|
||||
(df["Exp Date"] < pd.Timestamp(today))
|
||||
)
|
||||
filtered = df[mask].copy().reset_index(drop=True)
|
||||
print(f" Expired: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
def build_assigned_not_dispensed(df):
|
||||
mask = df["Subject ID"].notna() & df["Disp Date"].isna()
|
||||
filtered = df[mask].copy().reset_index(drop=True)
|
||||
print(f" Assigned not dispensed: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
def build_not_returned(df):
|
||||
no_ret = df[
|
||||
df["Date Ret"].isna() &
|
||||
df["Subject ID"].notna() &
|
||||
(df["Disp Status"].fillna("").str.upper() != "NOT DISPENSED")
|
||||
].copy()
|
||||
max_asgn = df.groupby("Subject ID")["Date Asgn"].max().rename("Max Visit Date")
|
||||
no_ret = no_ret.join(max_asgn, on="Subject ID")
|
||||
filtered = no_ret[no_ret["Date Asgn"] < no_ret["Max Visit Date"]].copy()
|
||||
filtered = filtered.drop(columns=["Qty Ret", "Date Ret", "Ret User", "Destroyed", "Basket No."])
|
||||
filtered = filtered.reset_index(drop=True)
|
||||
print(f" Not returned: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
def build_kits_for_destruction(df):
|
||||
mask = (
|
||||
df["Basket No."].isna() &
|
||||
(df["Date Ret"].notna() | (df["Disp Status"].fillna("").str.upper() == "NOT DISPENSED"))
|
||||
)
|
||||
filtered = (
|
||||
df[mask]
|
||||
.copy()
|
||||
.sort_values(["Site", "Date Ret"], ascending=[True, True])
|
||||
.drop(columns=["Destroyed", "Basket No."])
|
||||
.reset_index(drop=True)
|
||||
)
|
||||
print(f" Kits for destruction: {len(filtered)}")
|
||||
return filtered
|
||||
|
||||
|
||||
# ── Formátování ───────────────────────────────────────────────────────────────
|
||||
|
||||
def format_sheet(ws, header_color, highlight_col=None, highlight_color=None):
|
||||
thin = Side(style="thin", color="000000")
|
||||
border = Border(left=thin, right=thin, top=thin, bottom=thin)
|
||||
header_fill = PatternFill("solid", start_color=header_color)
|
||||
header_font = Font(bold=True, color="FFFFFF", name="Arial", size=10)
|
||||
row_font = Font(name="Arial", size=10)
|
||||
hi_fill = PatternFill("solid", start_color=highlight_color) if highlight_color else None
|
||||
|
||||
headers = [cell.value for cell in ws[1]]
|
||||
|
||||
for cell in ws[1]:
|
||||
cell.fill = header_fill
|
||||
cell.font = header_font
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=False)
|
||||
cell.border = border
|
||||
|
||||
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
|
||||
for cell in row:
|
||||
col_name = headers[cell.column - 1] if cell.column <= len(headers) else None
|
||||
cell.font = row_font
|
||||
cell.border = border
|
||||
cell.alignment = Alignment(horizontal="center")
|
||||
if col_name in DATE_COLUMNS:
|
||||
cell.number_format = "DD-MMM-YYYY"
|
||||
if hi_fill and col_name == highlight_col:
|
||||
cell.fill = hi_fill
|
||||
|
||||
for cell in ws[1]:
|
||||
width = COLUMN_WIDTHS.get(cell.value, 14)
|
||||
ws.column_dimensions[get_column_letter(cell.column)].width = width
|
||||
|
||||
ws.auto_filter.ref = ws.dimensions
|
||||
ws.freeze_panes = "A2"
|
||||
|
||||
|
||||
def format_overview_sheet(ws):
|
||||
format_sheet(ws, header_color="1F4E79")
|
||||
new_col_fill = PatternFill("solid", start_color="E2EFDA")
|
||||
headers = [c.value for c in ws[1]]
|
||||
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
|
||||
for cell in row:
|
||||
col_name = headers[cell.column - 1] if cell.column <= len(headers) else None
|
||||
if col_name in ("Destroyed", "Basket No."):
|
||||
cell.fill = new_col_fill
|
||||
|
||||
|
||||
def format_shipment_sheet(ws):
|
||||
thin = Side(style="thin", color="000000")
|
||||
border = Border(left=thin, right=thin, top=thin, bottom=thin)
|
||||
hfont = Font(bold=True, color="FFFFFF", name="Arial", size=10)
|
||||
dfont = Font(name="Arial", size=10)
|
||||
fill_ship = PatternFill("solid", start_color="1F4E79")
|
||||
fill_detail = PatternFill("solid", start_color="375623")
|
||||
|
||||
for cell in ws[1]:
|
||||
cell.fill = fill_ship if cell.column <= N_SHIP_COLS else fill_detail
|
||||
cell.font = hfont
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center", wrap_text=True)
|
||||
cell.border = border
|
||||
ws.column_dimensions[get_column_letter(cell.column)].width = min(
|
||||
len(str(cell.value or "")) + 4, 35
|
||||
)
|
||||
ws.row_dimensions[1].height = 30
|
||||
|
||||
for row in ws.iter_rows(min_row=2, max_row=ws.max_row):
|
||||
for cell in row:
|
||||
cell.font = dfont
|
||||
cell.border = border
|
||||
cell.alignment = Alignment(horizontal="center", vertical="center")
|
||||
if cell.value.__class__.__name__ in ("datetime", "date", "Timestamp"):
|
||||
cell.number_format = "DD-MMM-YYYY"
|
||||
|
||||
ws.auto_filter.ref = ws.dimensions
|
||||
ws.freeze_panes = "A2"
|
||||
|
||||
|
||||
# ── Main ──────────────────────────────────────────────────────────────────────
|
||||
|
||||
SHEETS_DEF = [
|
||||
("CountryMedicationOverview", "overview"),
|
||||
("Expired", "expired"),
|
||||
("Assigned not dispensed", "assigned"),
|
||||
("Not returned", "not_returned"),
|
||||
("Kits for destruction", "destruction"),
|
||||
("Shipments", "shipments"),
|
||||
("Site Summary", "site_summary"),
|
||||
]
|
||||
|
||||
FORMAT_MAP = {
|
||||
"overview": lambda ws: format_overview_sheet(ws),
|
||||
"expired": lambda ws: format_sheet(ws, "C00000", "Exp Date", "FFE0E0"),
|
||||
"assigned": lambda ws: format_sheet(ws, "833C00", "Subject ID", "FFF2CC"),
|
||||
"not_returned": lambda ws: format_sheet(ws, "375623", "Max Visit Date", "E2EFDA"),
|
||||
"destruction": lambda ws: format_sheet(ws, "595959"),
|
||||
"shipments": lambda ws: format_shipment_sheet(ws),
|
||||
"site_summary": lambda ws: format_sheet(ws, "1F4E79"),
|
||||
}
|
||||
|
||||
|
||||
def process_study(cursor, study):
|
||||
today = date.today().strftime("%d-%b-%Y")
|
||||
import_id = get_latest_import_id(cursor, study)
|
||||
print(f" import_id = {import_id}")
|
||||
|
||||
df = load_inventory(cursor, study, import_id)
|
||||
shipments_df = load_shipments(cursor, study, import_id)
|
||||
|
||||
expired_df = build_expired(df)
|
||||
assigned_df = build_assigned_not_dispensed(df)
|
||||
not_returned_df = build_not_returned(df)
|
||||
destruction_df = build_kits_for_destruction(df)
|
||||
site_summ_df = build_site_summary(shipments_df)
|
||||
|
||||
return [
|
||||
df, expired_df, assigned_df, not_returned_df,
|
||||
destruction_df, shipments_df, site_summ_df,
|
||||
]
|
||||
|
||||
|
||||
def save_study_report(study, data_frames):
|
||||
output_file = OUTPUT_DIR / f"{date.today().strftime('%Y-%m-%d')} {study} report.xlsx"
|
||||
|
||||
with pd.ExcelWriter(output_file, engine="openpyxl") as writer:
|
||||
for (sheet_name, _), df_sheet in zip(SHEETS_DEF, data_frames):
|
||||
df_sheet.to_excel(writer, index=False, sheet_name=sheet_name)
|
||||
|
||||
wb = load_workbook(output_file)
|
||||
for (sheet_name, fmt_key) in SHEETS_DEF:
|
||||
FORMAT_MAP[fmt_key](wb[sheet_name])
|
||||
wb.save(output_file)
|
||||
print(f" Uloženo: {output_file}")
|
||||
|
||||
|
||||
def main():
|
||||
OUTPUT_DIR.mkdir(exist_ok=True)
|
||||
|
||||
conn = get_conn()
|
||||
cursor = conn.cursor(dictionary=True)
|
||||
|
||||
for study, _ in STUDIES:
|
||||
print(f"\n{'='*55}")
|
||||
print(f"[{study}]")
|
||||
print(f"{'='*55}")
|
||||
try:
|
||||
data_frames = process_study(cursor, study)
|
||||
save_study_report(study, data_frames)
|
||||
except Exception as e:
|
||||
import traceback
|
||||
print(f" CHYBA: {e}")
|
||||
traceback.print_exc()
|
||||
|
||||
cursor.close()
|
||||
conn.close()
|
||||
print(f"\nHotovo.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,76 @@
|
||||
from playwright.sync_api import sync_playwright
|
||||
import os
|
||||
|
||||
# ── CONFIG ──────────────────────────────────────────────────────────────────
|
||||
BASE_URL = "https://janssen.4gclinical.com"
|
||||
|
||||
EMAIL = "vbuzalka@its.jnj.com"
|
||||
PASSWORD = "Vlado123++-+"
|
||||
|
||||
# STUDY = "42847922MDD3003"
|
||||
STUDY = "77242113UCO3001"
|
||||
|
||||
OUTPUT_DIR = f"xls_ip_destruction_{STUDY}"
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
def run(page, study):
|
||||
output_dir = f"xls_ip_destruction_{study}"
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
page.goto(f"{BASE_URL}/report/ip_destruction_form")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
page.locator('input[placeholder="search"], input[type="text"]').first.click()
|
||||
page.wait_for_timeout(1000)
|
||||
baskets = [b.strip() for b in page.locator('mat-option').all_inner_texts()
|
||||
if b.strip() and b.strip() != "No results found"]
|
||||
print(f" Nalezeno {len(baskets)} kosiku: {baskets}")
|
||||
page.keyboard.press("Escape")
|
||||
page.wait_for_timeout(500)
|
||||
|
||||
if not baskets:
|
||||
print(" Zadne destruction kosite — preskakuji.")
|
||||
return
|
||||
|
||||
for basket in baskets:
|
||||
filename = os.path.join(output_dir, f"ip_destruction_basket_{basket}.xlsx")
|
||||
if os.path.exists(filename):
|
||||
print(f" [{basket}] Preskakuji — existuje.")
|
||||
continue
|
||||
print(f" [{basket}] Stahuji...")
|
||||
input_field = page.locator('input[placeholder="search"], input[type="text"]').first
|
||||
input_field.click()
|
||||
input_field.fill(basket)
|
||||
page.wait_for_timeout(500)
|
||||
page.locator('mat-option').first.dispatch_event('click')
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
with page.expect_download(timeout=120000) as dl:
|
||||
page.get_by_role("button", name="Download XLS").click()
|
||||
dl.value.save_as(filename)
|
||||
print(f" [{basket}] OK")
|
||||
|
||||
page.get_by_role("button", name="Clear").click()
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
print(" Destruction hotovo.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from playwright.sync_api import sync_playwright
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=False)
|
||||
context = browser.new_context(accept_downloads=True)
|
||||
page = context.new_page()
|
||||
page.goto(BASE_URL)
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Email *").fill(EMAIL)
|
||||
page.get_by_label("Password *").fill(PASSWORD)
|
||||
page.locator('#login__submit').click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Study *").click()
|
||||
page.get_by_role("option", name=STUDY).click()
|
||||
page.get_by_role("button", name="SELECT").click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
run(page, STUDY)
|
||||
browser.close()
|
||||
@@ -0,0 +1,83 @@
|
||||
from playwright.sync_api import sync_playwright
|
||||
import os
|
||||
|
||||
# ── CONFIG ──────────────────────────────────────────────────────────────────
|
||||
BASE_URL = "https://janssen.4gclinical.com"
|
||||
|
||||
EMAIL = "vbuzalka@its.jnj.com"
|
||||
PASSWORD = "Vlado123++-+"
|
||||
|
||||
# STUDY = "42847922MDD3003"
|
||||
STUDY = "77242113UCO3001"
|
||||
|
||||
SITES = {
|
||||
"42847922MDD3003": [
|
||||
"S10-CZ10002",
|
||||
"S10-CZ10004",
|
||||
"S10-CZ10005",
|
||||
"S10-CZ10008",
|
||||
"S10-CZ10011",
|
||||
"S10-CZ10012",
|
||||
],
|
||||
"77242113UCO3001": [
|
||||
"DD5-CZ10001",
|
||||
"DD5-CZ10003",
|
||||
"DD5-CZ10006",
|
||||
"DD5-CZ10009",
|
||||
"DD5-CZ10010",
|
||||
"DD5-CZ10012",
|
||||
"DD5-CZ10013",
|
||||
"DD5-CZ10015",
|
||||
"DD5-CZ10016",
|
||||
"DD5-CZ10020",
|
||||
"DD5-CZ10021",
|
||||
"DD5-CZ10022",
|
||||
],
|
||||
}
|
||||
|
||||
OUTPUT_DIR = f"xls_reports_{STUDY}"
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
def run(page, study):
|
||||
output_dir = f"xls_reports_{study}"
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
page.goto(f"{BASE_URL}/report/onsite_inventory_detail")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
for site_id in SITES[study]:
|
||||
print(f" [{site_id}] Stahuji...")
|
||||
page.locator('input[placeholder="search"], input[type="text"]').first.click()
|
||||
page.get_by_role("option", name=site_id).click()
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
with page.expect_download(timeout=120000) as dl:
|
||||
page.get_by_role("button", name="Download XLS").click()
|
||||
|
||||
dl.value.save_as(os.path.join(output_dir, f"onsite_inventory_detail_{site_id}.xlsx"))
|
||||
print(f" [{site_id}] OK")
|
||||
|
||||
page.get_by_role("button", name="Clear").click()
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
print(" Inventory hotovo.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from playwright.sync_api import sync_playwright
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=False)
|
||||
context = browser.new_context(accept_downloads=True)
|
||||
page = context.new_page()
|
||||
page.goto(BASE_URL)
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Email *").fill(EMAIL)
|
||||
page.get_by_label("Password *").fill(PASSWORD)
|
||||
page.locator('#login__submit').click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Study *").click()
|
||||
page.get_by_role("option", name=STUDY).click()
|
||||
page.get_by_role("button", name="SELECT").click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
run(page, STUDY)
|
||||
browser.close()
|
||||
@@ -0,0 +1,95 @@
|
||||
from playwright.sync_api import sync_playwright
|
||||
import os
|
||||
import pandas as pd
|
||||
|
||||
# ── CONFIG ──────────────────────────────────────────────────────────────────
|
||||
BASE_URL = "https://janssen.4gclinical.com"
|
||||
|
||||
EMAIL = "vbuzalka@its.jnj.com"
|
||||
PASSWORD = "Vlado123++-+"
|
||||
|
||||
STUDY = "42847922MDD3003"
|
||||
#STUDY = "77242113UCO3001"
|
||||
|
||||
OUTPUT_DIR = f"xls_shipment_details_{STUDY}"
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
def get_cz_shipment_ids(study):
|
||||
path = f"xls_shipments_{study}/shipments_report_{study}.xlsx"
|
||||
if not os.path.exists(path):
|
||||
return None
|
||||
df = pd.read_excel(path, header=5)
|
||||
df.columns = df.columns.str.strip()
|
||||
df = df.dropna(how="all")
|
||||
df["Shipment ID"] = df["Shipment ID"].astype(str).str.strip()
|
||||
cz = df[df["Location"].str.contains("Czech", na=False, case=False)]
|
||||
return cz["Shipment ID"].tolist()
|
||||
|
||||
|
||||
def run(page, study):
|
||||
output_dir = f"xls_shipment_details_{study}"
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
page.goto(f"{BASE_URL}/report/shipment_details_report")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
cz_ids = get_cz_shipment_ids(study)
|
||||
if cz_ids is not None:
|
||||
shipments = cz_ids
|
||||
print(f" Filtrovano ze shipments reportu: {len(shipments)} CZ shipmentu")
|
||||
else:
|
||||
page.locator('input[placeholder="search"], input[type="text"]').first.click()
|
||||
page.wait_for_timeout(1000)
|
||||
shipments = [s.strip() for s in page.locator('mat-option').all_inner_texts()
|
||||
if s.strip() and s.strip() != "No results found"]
|
||||
print(f" Nalezeno {len(shipments)} shipmentu z dropdownu")
|
||||
page.keyboard.press("Escape")
|
||||
page.wait_for_timeout(500)
|
||||
|
||||
if not shipments:
|
||||
print(" Zadne shipments — preskakuji.")
|
||||
return
|
||||
|
||||
for shipment in shipments:
|
||||
filename = os.path.join(output_dir, f"shipment_details_{shipment}.xlsx")
|
||||
if os.path.exists(filename):
|
||||
print(f" [{shipment}] Preskakuji — existuje.")
|
||||
continue
|
||||
print(f" [{shipment}] Stahuji...")
|
||||
|
||||
input_field = page.locator('input[placeholder="search"], input[type="text"]').first
|
||||
input_field.click()
|
||||
input_field.fill(shipment)
|
||||
page.wait_for_timeout(500)
|
||||
page.locator('mat-option').first.dispatch_event('click')
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
with page.expect_download(timeout=120000) as dl:
|
||||
page.get_by_role("button", name="Download XLS").click()
|
||||
dl.value.save_as(filename)
|
||||
print(f" [{shipment}] OK")
|
||||
|
||||
page.get_by_role("button", name="Clear").click()
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
print(" Shipment details hotovo.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from playwright.sync_api import sync_playwright
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=False)
|
||||
context = browser.new_context(accept_downloads=True)
|
||||
page = context.new_page()
|
||||
page.goto(BASE_URL)
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Email *").fill(EMAIL)
|
||||
page.get_by_label("Password *").fill(PASSWORD)
|
||||
page.locator('#login__submit').click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Study *").click()
|
||||
page.get_by_role("option", name=STUDY).click()
|
||||
page.get_by_role("button", name="SELECT").click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
run(page, STUDY)
|
||||
browser.close()
|
||||
@@ -0,0 +1,47 @@
|
||||
from playwright.sync_api import sync_playwright
|
||||
import os
|
||||
|
||||
# ── CONFIG ──────────────────────────────────────────────────────────────────
|
||||
BASE_URL = "https://janssen.4gclinical.com"
|
||||
|
||||
EMAIL = "vbuzalka@its.jnj.com"
|
||||
PASSWORD = "Vlado123++-+"
|
||||
|
||||
# STUDY = "42847922MDD3003"
|
||||
STUDY = "77242113UCO3001"
|
||||
|
||||
OUTPUT_DIR = f"xls_shipments_{STUDY}"
|
||||
# ────────────────────────────────────────────────────────────────────────────
|
||||
|
||||
def run(page, study):
|
||||
output_dir = f"xls_shipments_{study}"
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
page.goto(f"{BASE_URL}/report/shipments_report")
|
||||
page.wait_for_load_state("networkidle", timeout=120000)
|
||||
|
||||
filename = os.path.join(output_dir, f"shipments_report_{study}.xlsx")
|
||||
with page.expect_download(timeout=120000) as dl:
|
||||
page.get_by_role("button", name="Download XLS").click()
|
||||
dl.value.save_as(filename)
|
||||
print(f" Shipments report OK -> {filename}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
from playwright.sync_api import sync_playwright
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=False)
|
||||
context = browser.new_context(accept_downloads=True)
|
||||
page = context.new_page()
|
||||
page.goto(BASE_URL)
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Email *").fill(EMAIL)
|
||||
page.get_by_label("Password *").fill(PASSWORD)
|
||||
page.locator('#login__submit').click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Study *").click()
|
||||
page.get_by_role("option", name=STUDY).click()
|
||||
page.get_by_role("button", name="SELECT").click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
run(page, STUDY)
|
||||
browser.close()
|
||||
@@ -0,0 +1,441 @@
|
||||
"""
|
||||
Importuje drugs data z IWRS Excel reportů do MySQL.
|
||||
|
||||
Tabulky:
|
||||
iwrs_shipments — zásilky (jen CZ, verzováno import_id)
|
||||
iwrs_shipment_items — obsah zásilek (verzováno import_id)
|
||||
iwrs_inventory — lékový sklad na centrech (verzováno import_id)
|
||||
iwrs_destruction — destrukce (bez verzování, přeskočí již importované košíky)
|
||||
|
||||
Spustit po stažení souborů (nebo přes run_all.py).
|
||||
"""
|
||||
|
||||
import os
|
||||
import glob
|
||||
import re
|
||||
import datetime
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
import mysql.connector
|
||||
|
||||
import db_config
|
||||
|
||||
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
|
||||
STUDIES = ["77242113UCO3001", "42847922MDD3003"]
|
||||
|
||||
SITES = {
|
||||
"77242113UCO3001": [
|
||||
"DD5-CZ10001", "DD5-CZ10003", "DD5-CZ10006", "DD5-CZ10009",
|
||||
"DD5-CZ10010", "DD5-CZ10012", "DD5-CZ10013", "DD5-CZ10015",
|
||||
"DD5-CZ10016", "DD5-CZ10020", "DD5-CZ10021", "DD5-CZ10022",
|
||||
],
|
||||
"42847922MDD3003": [
|
||||
"S10-CZ10002", "S10-CZ10004", "S10-CZ10005",
|
||||
"S10-CZ10008", "S10-CZ10011", "S10-CZ10012",
|
||||
],
|
||||
}
|
||||
|
||||
|
||||
# ── type converters ──────────────────────────────────────────────────────────
|
||||
|
||||
def _py(val):
|
||||
if isinstance(val, np.generic):
|
||||
return val.item()
|
||||
return val
|
||||
|
||||
def to_date(val):
|
||||
val = _py(val)
|
||||
if val is None:
|
||||
return None
|
||||
if isinstance(val, float) and (val != val):
|
||||
return None
|
||||
try:
|
||||
if pd.isna(val):
|
||||
return None
|
||||
except (TypeError, ValueError):
|
||||
pass
|
||||
if isinstance(val, pd.Timestamp):
|
||||
return None if pd.isna(val) else val.date()
|
||||
if isinstance(val, datetime.datetime):
|
||||
return val.date()
|
||||
if isinstance(val, datetime.date):
|
||||
return val
|
||||
s = str(val).strip()
|
||||
if not s or s.lower() in ("nat", "nan", "none", ""):
|
||||
return None
|
||||
for fmt in ("%Y-%m-%d", "%d-%b-%Y", "%d-%m-%Y", "%Y-%m-%d %H:%M:%S"):
|
||||
try:
|
||||
return datetime.datetime.strptime(s, fmt).date()
|
||||
except ValueError:
|
||||
pass
|
||||
return None
|
||||
|
||||
def to_int(val):
|
||||
val = _py(val)
|
||||
try:
|
||||
v = float(val)
|
||||
return None if (v != v) else int(v)
|
||||
except (TypeError, ValueError):
|
||||
return None
|
||||
|
||||
def to_str(val):
|
||||
val = _py(val)
|
||||
if val is None:
|
||||
return None
|
||||
if isinstance(val, float) and (val != val):
|
||||
return None
|
||||
s = str(val).strip()
|
||||
return None if s.lower() in ("nan", "nat", "none", "") else s
|
||||
|
||||
|
||||
# ── DB helpers ───────────────────────────────────────────────────────────────
|
||||
|
||||
def get_conn():
|
||||
return mysql.connector.connect(
|
||||
host=db_config.DB_HOST, port=db_config.DB_PORT,
|
||||
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
|
||||
database=db_config.DB_NAME,
|
||||
)
|
||||
|
||||
def insert_import(cursor, study, source_label):
|
||||
cursor.execute(
|
||||
"INSERT INTO iwrs_import (study, imported_at, source_file, report_type) VALUES (%s, %s, %s, %s)",
|
||||
(study, datetime.datetime.now(), source_label, "drugs"),
|
||||
)
|
||||
return cursor.lastrowid
|
||||
|
||||
def basket_already_imported(cursor, study, basket_id):
|
||||
cursor.execute(
|
||||
"SELECT 1 FROM iwrs_destruction WHERE study=%s AND basket_id=%s LIMIT 1",
|
||||
(study, str(basket_id)),
|
||||
)
|
||||
return cursor.fetchone() is not None
|
||||
|
||||
|
||||
# ── parsers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
def parse_shipments_report(study):
|
||||
path = os.path.join(BASE_DIR, f"xls_shipments_{study}", f"shipments_report_{study}.xlsx")
|
||||
if not os.path.exists(path):
|
||||
print(f" CHYBÍ: {path}")
|
||||
return []
|
||||
|
||||
raw = pd.read_excel(path, header=None)
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
if "Shipment ID" in [str(v).strip() for v in row]:
|
||||
header_row = i
|
||||
break
|
||||
if header_row is None:
|
||||
return []
|
||||
|
||||
df = pd.read_excel(path, header=header_row)
|
||||
df = df.dropna(how="all")
|
||||
# pouze CZ zásilky
|
||||
df = df[df["Location"].astype(str).str.contains("Czech", na=False, case=False)]
|
||||
col = df.columns.tolist()
|
||||
|
||||
rows = []
|
||||
for _, r in df.iterrows():
|
||||
rows.append({
|
||||
"shipment_id": to_str(r["Shipment ID"]),
|
||||
"status": to_str(r["IRT Shipment Status"]),
|
||||
"type": to_str(r["Type"]),
|
||||
"ship_from": to_str(r["Shipment From"]),
|
||||
"ship_to_site": to_str(r["Ship To:"]),
|
||||
"location": to_str(r["Location"]),
|
||||
"request_date": to_date(r["Request Date"]),
|
||||
"shipped_date": to_date(r["Shipped Date"]),
|
||||
"received_date": to_date(r["Received Date"]) if "Received Date" in col else None,
|
||||
"received_by": to_str(r["Received by"]) if "Received by" in col else None,
|
||||
"delivered_date_utc": to_date(r["Delivered Date [UTC]"]) if "Delivered Date [UTC]" in col else None,
|
||||
"delivery_recipient": to_str(r["Delivery Recipient"]) if "Delivery Recipient" in col else None,
|
||||
"delivery_details": to_str(r["Delivery Details"]) if "Delivery Details" in col else None,
|
||||
"cancelled_date": to_date(r["Cancelled Date"]) if "Cancelled Date" in col else None,
|
||||
"total_medication_ids": to_int(r["Total Medication IDs"]) if "Total Medication IDs" in col else None,
|
||||
"tracking_no": to_str(r["Tracking #"]) if "Tracking #" in col else None,
|
||||
"shipping_category": to_str(r["Shipping Category"]) if "Shipping Category" in col else None,
|
||||
"expected_arrival": to_date(r["Expected Arrival"]) if "Expected Arrival" in col else None,
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def parse_shipment_details(study):
|
||||
detail_dir = os.path.join(BASE_DIR, f"xls_shipment_details_{study}")
|
||||
files = sorted(glob.glob(os.path.join(detail_dir, "shipment_details_*.xlsx")))
|
||||
rows = []
|
||||
for path in files:
|
||||
# shipment ID z názvu souboru
|
||||
m = re.search(r"shipment_details_(.+)\.xlsx", os.path.basename(path))
|
||||
shipment_id = m.group(1) if m else "UNKNOWN"
|
||||
|
||||
raw = pd.read_excel(path, header=None)
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
if "Medication ID" in [str(v).strip() for v in row]:
|
||||
header_row = i
|
||||
break
|
||||
if header_row is None:
|
||||
continue
|
||||
|
||||
df = pd.read_excel(path, header=header_row)
|
||||
df = df.dropna(how="all")
|
||||
col = df.columns.tolist()
|
||||
|
||||
for _, r in df.iterrows():
|
||||
# normalizace názvů sloupců lišících se mezi studiemi
|
||||
med_desc = (to_str(r.get("Medication Description"))
|
||||
or to_str(r.get("Medication ID Description")))
|
||||
med_type = (to_str(r.get("Medication type"))
|
||||
or to_str(r.get("Medication ID type")))
|
||||
rows.append({
|
||||
"shipment_id": shipment_id,
|
||||
"destination_location": to_str(r.get("Destination Location")),
|
||||
"shipment_status": to_str(r.get("IRT Shipment Status")),
|
||||
"shipment_type": to_str(r.get("Type")),
|
||||
"destination_site": to_str(r.get("Destination Site")),
|
||||
"investigator": to_str(r.get("Investigator")),
|
||||
"medication_description": med_desc,
|
||||
"medication_type": med_type,
|
||||
"medication_id": to_str(r.get("Medication ID")),
|
||||
"packaged_lot_no": to_str(r.get("Packaged Lot number")),
|
||||
"packaged_lot_description": to_str(r.get("Packaged Lot description")),
|
||||
"container_id": to_str(r.get("Container ID")),
|
||||
"quantity": to_int(r.get("Quantity of Medication IDs")),
|
||||
"expiration_date": to_date(r.get("Expiration Date")),
|
||||
"item_status": to_str(r.get("Status")),
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def parse_inventory(study):
|
||||
inv_dir = os.path.join(BASE_DIR, f"xls_reports_{study}")
|
||||
files = sorted(glob.glob(os.path.join(inv_dir, "onsite_inventory_detail_*.xlsx")))
|
||||
rows = []
|
||||
for path in files:
|
||||
raw = pd.read_excel(path, header=None)
|
||||
|
||||
# extrahuj metadata ze záhlaví
|
||||
site = investigator = location = None
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
first = str(row.iloc[0]).strip() if pd.notna(row.iloc[0]) else ""
|
||||
if first.startswith("Site:"):
|
||||
site = first.replace("Site:", "").strip()
|
||||
elif first.startswith("Investigator:"):
|
||||
investigator = first.replace("Investigator:", "").strip()
|
||||
elif first.startswith("Location:"):
|
||||
location = first.replace("Location:", "").strip()
|
||||
# hlavička dat — první sloupec je "Medication" nebo "Medication ID"
|
||||
if first in ("Medication", "Medication ID") and header_row is None:
|
||||
header_row = i
|
||||
if header_row is None:
|
||||
continue
|
||||
|
||||
df = pd.read_excel(path, header=header_row)
|
||||
df = df.dropna(how="all")
|
||||
# normalizuj první sloupec na "medication_id"
|
||||
df = df.rename(columns={df.columns[0]: "medication_id"})
|
||||
col = df.columns.tolist()
|
||||
|
||||
for _, r in df.iterrows():
|
||||
rows.append({
|
||||
"site": site,
|
||||
"investigator": investigator,
|
||||
"location": location,
|
||||
"medication_id": to_str(r["medication_id"]),
|
||||
"packaged_lot_no": to_str(r.get("Packaged Lot number")),
|
||||
"original_expiration_date": to_date(r.get("Original Expiration Date when Packaged Lot was Added")),
|
||||
"expiration_date": to_date(r.get("Expiration date")),
|
||||
"received_date": to_date(r.get("Received Date")),
|
||||
"receipt_user": to_str(r.get("Shipment Receipt User")),
|
||||
"subject_identifier": to_str(r.get("Subject Identifier")),
|
||||
"quantity_assigned": to_int(r.get("Quantity Assigned")),
|
||||
"irt_transaction": to_str(r.get("IRT Transaction")),
|
||||
"date_assigned": to_date(r.get("Date Assigned")),
|
||||
"assignment_user": to_str(r.get("Assignment User")),
|
||||
"dispensation_status": to_str(r.get("Dispensation Status")),
|
||||
"dispensing_date": to_date(r.get("Dispensing date") or r.get("Dispensing Date")),
|
||||
"quantity_dispensed": to_int(r.get("Quantity Dispensed")),
|
||||
"dispensing_user": to_str(r.get("Dispensing User")),
|
||||
"quantity_returned": to_int(r.get("Quantity Returned")),
|
||||
"date_returned": to_date(r.get("Date Returned")),
|
||||
"return_user": to_str(r.get("Return User")),
|
||||
})
|
||||
return rows
|
||||
|
||||
|
||||
def parse_destruction_files(study):
|
||||
dest_dir = os.path.join(BASE_DIR, f"xls_ip_destruction_{study}")
|
||||
files = sorted(glob.glob(os.path.join(dest_dir, "ip_destruction_basket_*.xlsx")))
|
||||
baskets = []
|
||||
for path in files:
|
||||
raw = pd.read_excel(path, header=None)
|
||||
|
||||
# metadata z záhlaví
|
||||
meta = {}
|
||||
header_row = None
|
||||
for i, row in raw.iterrows():
|
||||
first = str(row.iloc[0]).strip() if pd.notna(row.iloc[0]) else ""
|
||||
for key, attr in [
|
||||
("Investigator Name:", "investigator"),
|
||||
("Site ID:", "site_id"),
|
||||
("Location:", "location"),
|
||||
("Basket ID:", "basket_id"),
|
||||
("Drug Destruction Created Date:", "destruction_date"),
|
||||
]:
|
||||
if first.startswith(key):
|
||||
meta[attr] = first.replace(key, "").strip()
|
||||
if first == "Medication ID Description" and header_row is None:
|
||||
header_row = i
|
||||
|
||||
if header_row is None:
|
||||
continue
|
||||
|
||||
df = pd.read_excel(path, header=header_row)
|
||||
df = df.dropna(how="all")
|
||||
|
||||
items = []
|
||||
for _, r in df.iterrows():
|
||||
items.append({
|
||||
"medication_description": to_str(r.get("Medication ID Description")),
|
||||
"medication_id": to_str(r.get("Medication ID")),
|
||||
"packaged_lot_description": to_str(r.get("Packaged Lot description")),
|
||||
"comments": to_str(r.get("Comments")),
|
||||
})
|
||||
|
||||
baskets.append({
|
||||
"site_id": meta.get("site_id"),
|
||||
"investigator": meta.get("investigator"),
|
||||
"location": meta.get("location"),
|
||||
"basket_id": meta.get("basket_id"),
|
||||
"destruction_date": to_date(meta.get("destruction_date")),
|
||||
"items": items,
|
||||
})
|
||||
return baskets
|
||||
|
||||
|
||||
# ── inserters ────────────────────────────────────────────────────────────────
|
||||
|
||||
def insert_shipments(cursor, import_id, study, rows):
|
||||
sql = """INSERT INTO iwrs_shipments
|
||||
(import_id, study, shipment_id, status, type, ship_from, ship_to_site,
|
||||
location, request_date, shipped_date, received_date, received_by,
|
||||
delivered_date_utc, delivery_recipient, delivery_details, cancelled_date,
|
||||
total_medication_ids, tracking_no, shipping_category, expected_arrival)
|
||||
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
|
||||
for r in rows:
|
||||
cursor.execute(sql, (
|
||||
import_id, study, r["shipment_id"], r["status"], r["type"],
|
||||
r["ship_from"], r["ship_to_site"], r["location"],
|
||||
r["request_date"], r["shipped_date"], r["received_date"],
|
||||
r["received_by"], r["delivered_date_utc"], r["delivery_recipient"],
|
||||
r["delivery_details"], r["cancelled_date"], r["total_medication_ids"],
|
||||
r["tracking_no"], r["shipping_category"], r["expected_arrival"],
|
||||
))
|
||||
|
||||
|
||||
def insert_shipment_items(cursor, import_id, study, rows):
|
||||
sql = """INSERT INTO iwrs_shipment_items
|
||||
(import_id, study, shipment_id, destination_location, shipment_status,
|
||||
shipment_type, destination_site, investigator, medication_description,
|
||||
medication_type, medication_id, packaged_lot_no, packaged_lot_description,
|
||||
container_id, quantity, expiration_date, item_status)
|
||||
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
|
||||
for r in rows:
|
||||
cursor.execute(sql, (
|
||||
import_id, study, r["shipment_id"], r["destination_location"],
|
||||
r["shipment_status"], r["shipment_type"], r["destination_site"],
|
||||
r["investigator"], r["medication_description"], r["medication_type"],
|
||||
r["medication_id"], r["packaged_lot_no"], r["packaged_lot_description"],
|
||||
r["container_id"], r["quantity"], r["expiration_date"], r["item_status"],
|
||||
))
|
||||
|
||||
|
||||
def insert_inventory(cursor, import_id, study, rows):
|
||||
sql = """INSERT INTO iwrs_inventory
|
||||
(import_id, study, site, investigator, location, medication_id,
|
||||
packaged_lot_no, original_expiration_date, expiration_date, received_date,
|
||||
receipt_user, subject_identifier, quantity_assigned, irt_transaction,
|
||||
date_assigned, assignment_user, dispensation_status, dispensing_date,
|
||||
quantity_dispensed, dispensing_user, quantity_returned, date_returned, return_user)
|
||||
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
|
||||
for r in rows:
|
||||
cursor.execute(sql, (
|
||||
import_id, study, r["site"], r["investigator"], r["location"],
|
||||
r["medication_id"], r["packaged_lot_no"], r["original_expiration_date"],
|
||||
r["expiration_date"], r["received_date"], r["receipt_user"],
|
||||
r["subject_identifier"], r["quantity_assigned"], r["irt_transaction"],
|
||||
r["date_assigned"], r["assignment_user"], r["dispensation_status"],
|
||||
r["dispensing_date"], r["quantity_dispensed"], r["dispensing_user"],
|
||||
r["quantity_returned"], r["date_returned"], r["return_user"],
|
||||
))
|
||||
|
||||
|
||||
def insert_destruction(cursor, study, baskets):
|
||||
sql = """INSERT IGNORE INTO iwrs_destruction
|
||||
(study, site_id, investigator, location, basket_id, destruction_date,
|
||||
medication_description, medication_id, packaged_lot_description, comments)
|
||||
VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
|
||||
skipped = 0
|
||||
imported = 0
|
||||
for b in baskets:
|
||||
if basket_already_imported(cursor, study, b["basket_id"]):
|
||||
skipped += 1
|
||||
continue
|
||||
for item in b["items"]:
|
||||
cursor.execute(sql, (
|
||||
study, b["site_id"], b["investigator"], b["location"],
|
||||
b["basket_id"], b["destruction_date"],
|
||||
item["medication_description"], item["medication_id"],
|
||||
item["packaged_lot_description"], item["comments"],
|
||||
))
|
||||
imported += 1
|
||||
return imported, skipped
|
||||
|
||||
|
||||
# ── main ─────────────────────────────────────────────────────────────────────
|
||||
|
||||
def import_study(study):
|
||||
print(f"\n Parsování dat pro {study}...")
|
||||
shipments = parse_shipments_report(study)
|
||||
items = parse_shipment_details(study)
|
||||
inventory = parse_inventory(study)
|
||||
baskets = parse_destruction_files(study)
|
||||
|
||||
print(f" Zásilky: {len(shipments)} | Položky zásilek: {len(items)} | Sklad: {len(inventory)} | Destrukční košíky: {len(baskets)}")
|
||||
|
||||
conn = get_conn()
|
||||
cursor = conn.cursor()
|
||||
|
||||
import_id = insert_import(cursor, study, f"drugs_{study}")
|
||||
print(f" import_id = {import_id}")
|
||||
|
||||
insert_shipments(cursor, import_id, study, shipments)
|
||||
insert_shipment_items(cursor, import_id, study, items)
|
||||
insert_inventory(cursor, import_id, study, inventory)
|
||||
dest_imported, dest_skipped = insert_destruction(cursor, study, baskets)
|
||||
|
||||
conn.commit()
|
||||
cursor.close()
|
||||
conn.close()
|
||||
print(f" Destrukce: {dest_imported} nových | {dest_skipped} košíků přeskočeno (již importováno)")
|
||||
|
||||
|
||||
def main():
|
||||
for study in STUDIES:
|
||||
print(f"\n{'='*60}")
|
||||
print(f"[{study}]")
|
||||
print(f"{'='*60}")
|
||||
try:
|
||||
import_study(study)
|
||||
print(f" OK")
|
||||
except Exception as e:
|
||||
import traceback
|
||||
print(f" CHYBA: {e}")
|
||||
traceback.print_exc()
|
||||
print("\nHotovo.")
|
||||
|
||||
|
||||
main()
|
||||
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,85 @@
|
||||
import sys
|
||||
import os
|
||||
from playwright.sync_api import sync_playwright
|
||||
|
||||
import download_reports
|
||||
import download_ip_destruction
|
||||
import download_shipments_report
|
||||
import download_shipment_details
|
||||
import create_accountability_report
|
||||
|
||||
BASE_URL = "https://janssen.4gclinical.com"
|
||||
EMAIL = "vbuzalka@its.jnj.com"
|
||||
PASSWORD = "Vlado123++-+"
|
||||
|
||||
STUDIES = {
|
||||
"1": "77242113UCO3001",
|
||||
"2": "42847922MDD3003",
|
||||
}
|
||||
|
||||
|
||||
def pick_study():
|
||||
print("Vyber studii:")
|
||||
for k, v in STUDIES.items():
|
||||
print(f" {k}) {v}")
|
||||
while True:
|
||||
choice = input("Volba (1/2): ").strip()
|
||||
if choice in STUDIES:
|
||||
return STUDIES[choice]
|
||||
print(" Neplatna volba, zkus znovu.")
|
||||
|
||||
|
||||
def login_and_select_study(page, study):
|
||||
print(f"\n[1/5] Prihlaseni a vyber studie {study}...")
|
||||
page.goto(BASE_URL)
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Email *").fill(EMAIL)
|
||||
page.get_by_label("Password *").fill(PASSWORD)
|
||||
page.locator('#login__submit').click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
page.get_by_label("Study *").click()
|
||||
page.get_by_role("option", name=study).click()
|
||||
page.get_by_role("button", name="SELECT").click()
|
||||
page.wait_for_load_state("networkidle")
|
||||
print(" OK")
|
||||
|
||||
|
||||
def main():
|
||||
os.chdir(os.path.dirname(os.path.abspath(__file__)))
|
||||
|
||||
study = pick_study()
|
||||
|
||||
with sync_playwright() as p:
|
||||
browser = p.chromium.launch(headless=False)
|
||||
context = browser.new_context(accept_downloads=True)
|
||||
page = context.new_page()
|
||||
|
||||
login_and_select_study(page, study)
|
||||
|
||||
print(f"\n[2/5] Stahuji inventory reporty...")
|
||||
download_reports.run(page, study)
|
||||
|
||||
print(f"\n[3/5] Stahuji IP destruction reporty...")
|
||||
download_ip_destruction.run(page, study)
|
||||
|
||||
print(f"\n[4/5] Stahuji shipments report...")
|
||||
download_shipments_report.run(page, study)
|
||||
|
||||
print(f"\n[5/5] Stahuji shipment details...")
|
||||
download_shipment_details.run(page, study)
|
||||
|
||||
browser.close()
|
||||
|
||||
print(f"\n[6/6] Generuji accountability report...")
|
||||
create_accountability_report.STUDY = study
|
||||
create_accountability_report.INVENTORY_DIR = __import__("pathlib").Path(f"xls_reports_{study}")
|
||||
create_accountability_report.DESTRUCTION_DIR= __import__("pathlib").Path(f"xls_ip_destruction_{study}")
|
||||
create_accountability_report.SHIPMENTS_FILE = __import__("pathlib").Path(f"xls_shipments_{study}/shipments_report_{study}.xlsx")
|
||||
create_accountability_report.DETAILS_DIR = __import__("pathlib").Path(f"xls_shipment_details_{study}")
|
||||
create_accountability_report.OUTPUT_FILE = create_accountability_report.OUTPUT_DIR / f"{__import__('datetime').date.today().strftime('%Y-%m-%d')} {study} CZ IWRS overview.xlsx"
|
||||
create_accountability_report.main()
|
||||
|
||||
print("\nVse hotovo!")
|
||||
|
||||
|
||||
main()
|
||||
@@ -0,0 +1,5 @@
|
||||
DB_HOST = "192.168.1.76"
|
||||
DB_PORT = 3306
|
||||
DB_USER = "root"
|
||||
DB_PASSWORD = "Vlado9674+"
|
||||
DB_NAME = "studie"
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
@@ -0,0 +1,52 @@
|
||||
import mysql.connector
|
||||
import pandas as pd
|
||||
import db_config
|
||||
|
||||
conn = mysql.connector.connect(
|
||||
host=db_config.DB_HOST, port=db_config.DB_PORT,
|
||||
user=db_config.DB_USER, password=db_config.DB_PASSWORD,
|
||||
database=db_config.DB_NAME,
|
||||
)
|
||||
cursor = conn.cursor(dictionary=True)
|
||||
|
||||
# Vezmi nejnovější import_id pro každou studii
|
||||
for study in ["77242113UCO3001", "42847922MDD3003"]:
|
||||
cursor.execute(
|
||||
"SELECT MAX(import_id) AS mid FROM iwrs_import WHERE study=%s AND report_type='patients'",
|
||||
(study,),
|
||||
)
|
||||
row = cursor.fetchone()
|
||||
mid = row["mid"]
|
||||
print(f"\n=== {study} (import_id={mid}) ===")
|
||||
|
||||
cursor.execute("""
|
||||
SELECT
|
||||
v.subject,
|
||||
v.actual_date,
|
||||
v.scheduled_date,
|
||||
v.irt_transaction_no,
|
||||
v.irt_transaction_description,
|
||||
v.medication_assignment,
|
||||
GROUP_CONCAT(v.medication_id ORDER BY v.medication_id SEPARATOR ', ') AS medication_ids,
|
||||
SUM(v.quantity_assigned) AS quantity_assigned
|
||||
FROM iwrs_subject_visits v
|
||||
WHERE v.import_id = %s AND v.study = %s AND v.visit_type = 'Past'
|
||||
AND v.irt_transaction_no IS NOT NULL
|
||||
GROUP BY v.subject, v.actual_date, v.scheduled_date, v.irt_transaction_no,
|
||||
v.irt_transaction_description, v.medication_assignment
|
||||
ORDER BY v.subject, v.actual_date
|
||||
LIMIT 20
|
||||
""", (mid, study))
|
||||
|
||||
rows = cursor.fetchall()
|
||||
df = pd.DataFrame(rows)
|
||||
if df.empty:
|
||||
print(" Žádná data.")
|
||||
else:
|
||||
pd.set_option("display.max_columns", None)
|
||||
pd.set_option("display.width", 200)
|
||||
pd.set_option("display.max_colwidth", 30)
|
||||
print(df.to_string(index=False))
|
||||
|
||||
cursor.close()
|
||||
conn.close()
|
||||
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user