Skip to content
Projects
Groups
Snippets
Help
Loading...
Help
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
B
bcom-tp-etl-transformation-pipelines
Project
Project
Details
Activity
Releases
Cycle Analytics
Repository
Repository
Files
Commits
Branches
Tags
Contributors
Graph
Compare
Charts
Issues
0
Issues
0
List
Board
Labels
Milestones
Merge Requests
0
Merge Requests
0
CI / CD
CI / CD
Pipelines
Jobs
Schedules
Charts
Wiki
Wiki
Snippets
Snippets
Members
Members
Collapse sidebar
Close sidebar
Activity
Graph
Charts
Create a new issue
Jobs
Commits
Issue Boards
Open sidebar
general
bcom-tp-etl-transformation-pipelines
Commits
0e639513
Commit
0e639513
authored
Aug 02, 2023
by
Cristian Aguirre
Browse files
Options
Browse Files
Download
Email Patches
Plain Diff
Update 01-08-23. Fix some bugs (377, 378)
parent
03b3285a
Changes
15
Show whitespace changes
Inline
Side-by-side
Showing
15 changed files
with
113 additions
and
71 deletions
+113
-71
README.md
README.md
+29
-1
Extractor.py
dags/components/Extractor.py
+1
-0
Generation.py
dags/components/Generation.py
+1
-0
S3Route.py
dags/components/S3Route.py
+3
-1
Utils.py
dags/components/Utils.py
+1
-1
dag_conf.yml
dags/dag_conf.yml
+7
-7
dag_inform_process.py
dags/dag_inform_process.py
+18
-17
airflow-envvars-configmap.yaml
deploy-k8/airflow-envvars-configmap.yaml
+21
-16
airflow-scheduler-deployment.yaml
deploy-k8/airflow-scheduler-deployment.yaml
+4
-0
airflow-volumes.yaml
deploy-k8/airflow-volumes.yaml
+12
-12
airflow-webserver-deployment.yaml
deploy-k8/airflow-webserver-deployment.yaml
+4
-0
postgres-deployment.yaml
deploy-k8/postgres-deployment.yaml
+2
-2
script-apply.sh
deploy-k8/script-apply.sh
+1
-1
script-delete.sh
deploy-k8/script-delete.sh
+1
-1
sync-dags-deployment-gcs.yaml
deploy-k8/sync-dags-deployment-gcs.yaml
+8
-12
No files found.
README.md
View file @
0e639513
...
...
@@ -456,8 +456,36 @@ Listo. Con esto se ha configurado todos los servicios requeridos para levantar n
### Desplegando Airflow con GKE
⚠️ Ojo: Antes de desplegar nuestros componentes, hay que asegurarnos que cumple con los
recursos de nuestro clúster (en cuestión de vCPU y memoria RAM), para esto hay que validar
los templates que son de tipo despliegue (deployment), estos son los que en su nombre termina
con la palabra _"deployment"_: "airflow-scheduler-deployment,yaml", "airflow-webserver-deployment.yaml",
"postgres-deployment.yaml" y "sync-dags-deployment-gcs" (para este caso que estamos en GCP).
1.
-
Además del archivo "airflow-envvars-configmap.yaml" que contiene los recursos para los workers.
Todos los templates mencionados tienen sus parámetros "containers.resources.limits" o
"containers.resources.requests" las cuales indican el máximo y mínimo, respectivamente.
1.
- Luego de tener ya configurado nuestro Bucket y NFS (Filestore), podemos desplegar nuestro sistema
Airflow con Kubernetes. Para esto nos situamos dentro de la carpeta
**"deploy-k8"**
y ejecutamos
el siguiente comando:
```
shell
sh script-apply.sh
```
Esto arrancará todos los componentes y dentro de unos segundos o minutos estará todo arriba,
listo para su uso.
2.
- Para validar que están todos los componentes arriba, se ejecuta el siguiente comando:
```
shell
kubectl get pods
-n
bcom-airflow
```
Deberá mostrarse algo similar a esto:

dags/components/Extractor.py
View file @
0e639513
...
...
@@ -4,6 +4,7 @@ import json
import
numpy
as
np
import
pandas
as
pd
from
enums.ProcessStatusEnum
import
ProcessStatusEnum
from
enums.DatabaseTypeEnum
import
DatabaseTypeEnum
from
components.Utils
import
select_multiple
,
generateModel
from
components.DatabaseOperation.DatabaseExtraction
import
get_iterator
,
get_steps
from
components.DatabaseOperation.DatabaseLoad
import
save_from_dataframe
...
...
dags/components/Generation.py
View file @
0e639513
...
...
@@ -8,6 +8,7 @@ from airflow.decorators import task
from
airflow.exceptions
import
AirflowSkipException
from
enums.ProcessStatusEnum
import
ProcessStatusEnum
from
enums.DatabaseTypeEnum
import
DatabaseTypeEnum
from
components.S3Route
import
save_df_to_s3
,
load_control_to_s3
from
components.Utils
import
select_multiple
,
create_temp_file
,
delete_temp_dir
from
components.Control
import
get_tasks_from_control
,
update_new_process
...
...
dags/components/S3Route.py
View file @
0e639513
...
...
@@ -4,6 +4,7 @@ from typing import Any, Dict, List, Tuple
import
pytz
from
io
import
BytesIO
,
StringIO
import
pandas
as
pd
import
re
from
components.Utils
import
get_type_file
from
enums.FileTypeEnum
import
FileTypeEnum
...
...
@@ -181,6 +182,7 @@ def get_file_from_prefix(conn: str, bucket: str, key: str, provider: str, timezo
frequency
:
str
=
"montly"
)
->
Any
:
result
,
key_result
=
BytesIO
(),
''
try
:
format_re_pattern
=
'[0-9]{4}-[0-9]{2}
\
.'
format_date
=
"
%
Y-
%
m"
if
frequency
==
"montly"
else
"
%
Y-
%
W"
period
=
str
(
datetime_by_tzone
(
timezone
,
format_date
))[:
7
]
logger
.
info
(
f
"Periodo actual: {period}."
)
...
...
@@ -197,7 +199,7 @@ def get_file_from_prefix(conn: str, bucket: str, key: str, provider: str, timezo
files
=
gcp_hook
.
list
(
bucket
,
prefix
=
key
)
files_with_period
=
[]
for
file
in
files
:
if
file
.
endswith
(
"/"
):
if
file
.
endswith
(
"/"
)
or
not
re
.
search
(
format_re_pattern
,
file
)
:
continue
file_period
=
file
[
file
.
rfind
(
"_"
)
+
1
:
file
.
rfind
(
"."
)]
files_with_period
.
append
((
file
,
file_period
))
...
...
dags/components/Utils.py
View file @
0e639513
...
...
@@ -107,7 +107,7 @@ def update_sql_commands(dataset: List[Tuple[str, str]], label_tablename: str) ->
final_data
[
-
1
]
=
final_data
[
-
1
]
+
"; end;"
final_item
=
item
if
item
.
lower
()
.
strip
()
.
find
(
label_tablename
.
lower
()
.
strip
()
+
":"
)
!=
-
1
:
init_index
=
item
.
lower
()
.
strip
()
.
index
(
label_tablename
.
lower
()
.
strip
()
+
":"
)
init_index
=
item
.
replace
(
" "
,
""
)
.
lower
()
.
strip
()
.
index
(
label_tablename
.
lower
()
.
strip
()
+
":"
)
table_name
=
item
.
replace
(
" "
,
""
)
.
strip
()[
init_index
+
len
(
label_tablename
+
":"
):]
.
strip
()
add_next
=
True
elif
item
!=
""
:
...
...
dags/dag_conf.yml
View file @
0e639513
...
...
@@ -10,7 +10,7 @@ app:
port
:
3306
username
:
admin
password
:
adminadmin
database
:
prueba_
bcom
database
:
prueba_
ca_1
service
:
ORCLPDB1
schema
:
sources
transformation
:
...
...
@@ -19,7 +19,7 @@ app:
port
:
3306
username
:
admin
password
:
adminadmin
database
:
prueba_
bcom
2
database
:
prueba_
ca_
2
service
:
schema
:
intern_db
chunksize
:
8000
...
...
@@ -28,16 +28,16 @@ app:
procedure_mask
:
procedure
# S
transformation_mask
:
transform
# S
prefix_order_delimiter
:
.
cloud_provider
:
google
cloud_provider
:
aws
scripts
:
s3_params
:
bucket
:
prueba-airflow3
bucket
:
prueba-airflow
1
3
prefix
:
bcom_scripts
connection_id
:
conn_script
control
:
s3_params
:
connection_id
:
conn_script
bucket
:
prueba
-airflow3
bucket
:
prueba
1234568
prefix
:
bcom_control
filename
:
control_<period>.json
timezone
:
'
GMT-5'
...
...
@@ -48,12 +48,12 @@ app:
delimiter
:
'
|'
tmp_path
:
/tmp
s3_params
:
bucket
:
prueba-airflow3
bucket
:
prueba-airflow
1
3
prefix
:
bcom_results
connection_id
:
conn_script
report
:
s3_params
:
bucket
:
prueba
-airflow3
bucket
:
prueba
1234568
prefix
:
bcom_report
connection_id
:
conn_script
filename
:
report_<datetime>.xlsx
...
...
dags/dag_inform_process.py
View file @
0e639513
...
...
@@ -76,7 +76,7 @@ def create_report(tmp_path: str, **kwargs) -> None:
title_format
.
set_font_size
(
20
)
title_format
.
set_font_color
(
"#333333"
)
header
=
f
"
Reporte
ejecutado el día {execution_date}"
header
=
f
"
Proceso
ejecutado el día {execution_date}"
if
status
==
ProcessStatusEnum
.
SUCCESS
.
value
:
status
=
"EXITOSO"
elif
status
==
ProcessStatusEnum
.
FAIL
.
value
:
...
...
@@ -95,7 +95,7 @@ def create_report(tmp_path: str, **kwargs) -> None:
row_format
=
workbook
.
add_format
()
row_format
.
set_font_size
(
8
)
row_format
.
set_font_color
(
"#000000"
)
if
status
!=
ProcessStatusEnum
.
RESET
.
value
:
base_index
=
5
for
index
,
key
in
enumerate
(
data
.
keys
()):
index
=
base_index
+
index
...
...
@@ -125,7 +125,8 @@ def get_data_report(**kwargs) -> None:
else
:
last_process
=
control
[
-
1
]
if
"reset_by_user"
in
last_process
.
keys
():
report_data
[
"PROCESS_EXECUTION"
]
=
ProcessStatusEnum
.
RESET
.
value
report_data
[
"PROCESS_STATUS"
]
=
ProcessStatusEnum
.
RESET
.
value
report_data
[
"PROCESS_EXECUTION"
]
=
last_process
[
"date"
]
else
:
total_tasks
=
[
last_process
[
"tasks"
]]
current_status
=
last_process
[
"status"
]
...
...
deploy-k8/airflow-envvars-configmap.yaml
View file @
0e639513
apiVersion
:
v1
kind
:
Namespace
metadata
:
name
:
bcom-airflow
---
apiVersion
:
v1
kind
:
ServiceAccount
metadata
:
name
:
bcom-airflow
namespace
:
bcom-airflow
---
#
apiVersion: v1
#
kind: Namespace
#
metadata:
#
name: bcom-airflow
#
#
---
#
apiVersion: v1
#
kind: ServiceAccount
#
metadata:
#
name: bcom-airflow
#
namespace: bcom-airflow
#
#
---
apiVersion
:
v1
kind
:
ConfigMap
...
...
@@ -36,6 +36,10 @@ data:
value: LocalExecutor
image: dumy-image
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "1000m"
memory: "2Gi"
name: base
volumeMounts:
- name: dags-host-volume
...
...
@@ -77,19 +81,20 @@ data:
AIRFLOW__CORE__DEFAULT_TIMEZONE
:
America/Lima
AIRFLOW__KUBERNETES__KUBE_CLIENT_REQUEST_ARGS
:
'
{"_request_timeout":
[60,60]}'
AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY
:
cristianfernando/airflow_custom
AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG
:
"
0.0.
5
"
AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG
:
"
0.0.
6
"
AIRFLOW__KUBERNETES__LOGS_VOLUME_CLAIM
:
airflow-logs-pvc
AIRFLOW__KUBERNETES__ENV_FROM_CONFIGMAP_REF
:
airflow-envvars-configmap
AIRFLOW__KUBERNETES_EXECUTOR__POD_TEMPLATE_FILE
:
/opt/airflow/templates/pod_template.yaml
AIRFLOW__CORE__EXECUTOR
:
Kubernetes
Executor
AIRFLOW__CORE__EXECUTOR
:
Local
Executor
AIRFLOW__DATABASE__SQL_ALCHEMY_CONN
:
postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION
:
'
true'
AIRFLOW__CORE__LOAD_EXAMPLES
:
'
tru
e'
AIRFLOW__CORE__LOAD_EXAMPLES
:
'
fals
e'
_AIRFLOW_DB_UPGRADE
:
'
true'
_AIRFLOW_WWW_USER_CREATE
:
'
true'
_AIRFLOW_WWW_USER_USERNAME
:
admin
_AIRFLOW_WWW_USER_PASSWORD
:
admin
S3_DAGS_DIR
:
'
s3://prueba1234568/dags'
GCS_DAGS_DIR
:
'
gs://prueba-rsync2/carpeta'
SYNCHRONYZE_DAG_DIR
:
'
30'
MINIO_SERVER
:
'
http://192.168.49.2:9000'
MINIO_DAGS_DIR
:
'
/prueba-ca/dags'
\ No newline at end of file
deploy-k8/airflow-scheduler-deployment.yaml
View file @
0e639513
...
...
@@ -22,6 +22,10 @@ spec:
containers
:
-
name
:
airflow-scheduler
image
:
cristianfernando/airflow_custom:0.0.4
resources
:
requests
:
cpu
:
"
1000m"
memory
:
"
4Gi"
args
:
[
"
scheduler"
]
envFrom
:
-
configMapRef
:
...
...
deploy-k8/airflow-volumes.yaml
View file @
0e639513
...
...
@@ -5,13 +5,13 @@ metadata:
namespace
:
bcom-airflow
spec
:
capacity
:
storage
:
300M
i
storage
:
5G
i
accessModes
:
-
ReadWriteMany
storageClassName
:
airflow-dags
nfs
:
server
:
1
92.168.1.9
path
:
"
/
mnt
/nfs_share"
server
:
1
0.216.137.186
path
:
"
/
volume1
/nfs_share"
---
...
...
@@ -22,13 +22,13 @@ metadata:
namespace
:
bcom-airflow
spec
:
capacity
:
storage
:
8000M
i
storage
:
16G
i
accessModes
:
-
ReadWriteMany
storageClassName
:
airflow-postgres
nfs
:
server
:
1
92.168.1.9
path
:
"
/
mnt
/nfs_postgres"
server
:
1
0.216.137.186
path
:
"
/
volume1
/nfs_postgres"
---
...
...
@@ -39,13 +39,13 @@ metadata:
namespace
:
bcom-airflow
spec
:
capacity
:
storage
:
4000M
i
storage
:
10G
i
accessModes
:
-
ReadWriteMany
storageClassName
:
airflow-logs
nfs
:
server
:
1
92.168.1.9
path
:
"
/
mnt
/nfs_logs"
server
:
1
0.216.137.186
path
:
"
/
volume1
/nfs_logs"
---
...
...
@@ -60,7 +60,7 @@ spec:
storageClassName
:
airflow-dags
resources
:
requests
:
storage
:
200M
i
storage
:
5G
i
---
...
...
@@ -75,7 +75,7 @@ spec:
storageClassName
:
airflow-postgres
resources
:
requests
:
storage
:
7500M
i
storage
:
16G
i
---
...
...
@@ -91,5 +91,5 @@ spec:
-
ReadWriteMany
resources
:
requests
:
storage
:
3500M
i
storage
:
10G
i
storageClassName
:
airflow-logs
\ No newline at end of file
deploy-k8/airflow-webserver-deployment.yaml
View file @
0e639513
...
...
@@ -22,6 +22,10 @@ spec:
containers
:
-
name
:
airflow-webserver
image
:
apache/airflow:2.5.3
resources
:
requests
:
cpu
:
"
500m"
memory
:
"
500Mi"
args
:
[
"
webserver"
]
envFrom
:
-
configMapRef
:
...
...
deploy-k8/postgres-deployment.yaml
View file @
0e639513
...
...
@@ -21,8 +21,8 @@ spec:
image
:
postgres:12
resources
:
limits
:
memory
:
128Mi
cpu
:
500m
memory
:
"
2Gi"
cpu
:
"
500m"
ports
:
-
containerPort
:
5432
env
:
...
...
deploy-k8/script-apply.sh
View file @
0e639513
...
...
@@ -7,4 +7,4 @@ kubectl apply -f airflow-secrets.yaml
kubectl apply
-f
airflow-webserver-deployment.yaml
kubectl apply
-f
airflow-webserver-service.yaml
kubectl apply
-f
airflow-scheduler-deployment.yaml
kubectl apply
-f
sync-dags-deployment.yaml
kubectl apply
-f
sync-dags-deployment
-gcs
.yaml
deploy-k8/script-delete.sh
View file @
0e639513
...
...
@@ -5,6 +5,6 @@ kubectl delete -f airflow-secrets.yaml
kubectl delete
-f
airflow-webserver-service.yaml
kubectl delete
-f
airflow-webserver-deployment.yaml
kubectl delete
-f
airflow-scheduler-deployment.yaml
kubectl delete
-f
sync-dags-deployment.yaml
kubectl delete
-f
sync-dags-deployment
-gcs
.yaml
kubectl delete
-f
airflow-volumes.yaml
kubectl delete
-f
airflow-envvars-configmap.yaml
\ No newline at end of file
deploy-k8/sync-dags-deployment-gcs.yaml
View file @
0e639513
...
...
@@ -14,9 +14,12 @@ spec:
app
:
airflow-sync-dags
spec
:
serviceAccountName
:
bcom-airflow
nodeSelector
:
iam.gke.io/gke-metadata-server-enabled
:
"
true"
containers
:
-
args
:
-
while
true
; g
cloud rsync -d -r ${GCS_DAGS_DIR:-gs://prueba-rsync
/carpeta} /dags;
-
while
true
; g
sutil rsync -d -r ${GCS_DAGS_DIR:-gs://prueba-rsync2
/carpeta} /dags;
do sleep ${SYNCHRONYZE_DAG_DIR:-30}; done;
command
:
-
/bin/bash
...
...
@@ -24,20 +27,13 @@ spec:
-
--
name
:
sync-dags-gcloud
image
:
gcr.io/google.com/cloudsdktool/google-cloud-cli:alpine
resources
:
limits
:
cpu
:
"
250m"
memory
:
"
1Gi"
envFrom
:
-
configMapRef
:
name
:
airflow-envvars-configmap
env
:
-
name
:
AWS_ACCESS_KEY_ID
valueFrom
:
secretKeyRef
:
key
:
AWS_ACCESS_KEY
name
:
credentials
-
name
:
AWS_SECRET_ACCESS_KEY
valueFrom
:
secretKeyRef
:
key
:
AWS_SECRET_KEY
name
:
credentials
volumeMounts
:
-
name
:
dags-host-volume
mountPath
:
/dags
...
...
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment