🧠 Análisis de Supervivencia en el Titanic con GODML
Este análisis utiliza datos del famoso naufragio del Titanic para predecir la probabilidad de supervivencia de los pasajeros usando modelos de clasificación. El objetivo es comparar el rendimiento de dos algoritmos: Random Forest y XGBoost, utilizando el framework GODML para experimentación reproducible.
📥 Descripción del Dataset
El conjunto de datos contiene información sobre los pasajeros del Titanic. Las columnas más relevantes incluyen:
Pclass
: Clase del boleto (1ª, 2ª, 3ª)Sex
: Sexo del pasajeroAge
: EdadSibSp
: Número de hermanos/esposos a bordoParch
: Número de padres/hijos a bordoFare
: Tarifa pagadaEmbarked
: Puerto de embarqueSurvived
: 0 = No sobrevivió, 1 = Sobrevivió (esta es nuestra variable objetivo)
En el preprocesamiento transformamos esta variable en target
y codificamos las variables categóricas.
🔍 Comparación de Modelos: Random Forest vs XGBoost
Entrenaremos dos modelos con GODML:
RandomForestClassifier
: Modelo de ensamble basado en árboles.XGBoostClassifier
: Algoritmo de boosting altamente eficiente y preciso.
Compararemos su rendimiento con métricas estándar.
# Si estás dentro del repo del proyecto:
# Esto instala GODML en editable mode para reflejar cambios al vuelo.
%pip install -e .
# Dependencias útiles (si no las tienes ya)
%pip install scikit-learn matplotlib joblib pyyaml
# Quality of life
%load_ext autoreload
%autoreload 2
import warnings
warnings.filterwarnings("ignore") # menos ruido visual
🧪 GODML con Dataset Real de Titanic
Este notebook descarga, prepara y ejecuta un modelo Random Forest con GODML usando el dataset oficial de Titanic.
import pandas as pd
from pathlib import Path
from godml import notebook_api as nb
# Descargar dataset real
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")
# Guardar como CSV
df.to_csv("data/titanic.csv", index=False, encoding="utf-8")
# 👉👉 EDITA ESTA RUTA 👈👈
TITANIC_CSV = Path("data/titanic.csv")
assert TITANIC_CSV.exists(), f"No existe el CSV: {TITANIC_CSV}"
df_raw = pd.read_csv(TITANIC_CSV)
print("Shape:", df_raw.shape)
df_raw.head(3)
Shape: (891, 12)
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
🧪 Celda 2 — DataPrep inline (casting + imputación + encoding)
# GODML DataPrep SOLO (compat): sin strip_strings, limpiamos con pandas antes
from pathlib import Path
import pandas as pd
from godml import notebook_api as nb
# 0) df_raw ya existe
df_local = df_raw.copy()
# 0.1) Normalizar target => 'survived'
cols_norm = {c: c.strip() for c in df_local.columns}
df_local = df_local.rename(columns=cols_norm)
if "Survived" in df_local.columns and "survived" not in df_local.columns:
df_local = df_local.rename(columns={"Survived": "survived"})
# 0.2) Trim de strings con pandas (reemplaza strip_strings)
for c in ["Sex", "Embarked"]:
if c in df_local.columns:
df_local[c] = df_local[c].astype(str).str.strip()
# 0.3) Persistimos input / output
data = Path(".")
inp = (data / "titanic_input.csv").as_posix()
out = (data / "titanic_clean_godml.csv").as_posix()
df_local.to_csv(inp, index=False)
# 1) Receta GODML sin strip_strings
recipe = {
"inputs": [{"name": "raw", "connector": "csv", "uri": inp}],
"steps": [
{"op": "select", "params": {"columns": [
"survived" if "survived" in df_local.columns else "Survived",
"Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"
]}},
{"op": "safe_cast", "params": {"mapping": {
"Age": "float", "Fare": "float",
"Pclass": "int", "SibSp": "int", "Parch": "int"
}}},
{"op": "fillna", "params": {"columns": {"Age": 28.0, "Embarked": "S"}}},
{"op": "fillna", "params": {"columns": {"Fare": 14.45}}},
{"op": "one_hot", "params": {"columns": ["Sex", "Embarked"], "drop_first": True}},
{"op": "rename", "params": {"mapping": {"Survived": "survived"}}},
{"op": "safe_cast", "params": {"mapping": {"survived": "int"}}},
{"op": "drop_duplicates", "params": {}},
],
"outputs": [{"name": "clean", "connector": "csv", "uri": out}],
}
df_clean = nb.dataprep_run_inline(recipe)
print("✅ DataPrep OK — shape:", df_clean.shape)
nb.summarize_df(df_clean)
[OPENLINEAGE] 2025-08-12T23:20:13.885362 READ id=1c7bba31-0514-4366-a701-6daa56e05dba payload={'dataset': 'titanic_input.csv', 'connector': 'csv'}
[OPENLINEAGE] 2025-08-12T23:20:13.889553 TRANSFORM id=ceb5f4e1-4702-4d49-8e8e-b658fc6faa04 payload={'op': 'select', 'params': {'columns': ['survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']}}
[OPENLINEAGE] 2025-08-12T23:20:13.889553 TRANSFORM id=b67bc858-c0a4-40f9-8e88-ce084ca546ac payload={'op': 'safe_cast', 'params': {'mapping': {'Age': 'float', 'Fare': 'float', 'Pclass': 'int', 'SibSp': 'int', 'Parch': 'int'}}}
[OPENLINEAGE] 2025-08-12T23:20:13.891564 TRANSFORM id=e63acc54-d098-439e-a522-34150000273a payload={'op': 'fillna', 'params': {'columns': {'Age': 28.0, 'Embarked': 'S'}}}
[OPENLINEAGE] 2025-08-12T23:20:13.891564 TRANSFORM id=a489327c-8367-4b50-b3df-81a8794bc652 payload={'op': 'fillna', 'params': {'columns': {'Fare': 14.45}}}
[OPENLINEAGE] 2025-08-12T23:20:13.895587 TRANSFORM id=c3ac68ac-9fea-41db-bc41-1dea8cd8e50b payload={'op': 'one_hot', 'params': {'columns': ['Sex', 'Embarked'], 'drop_first': True}}
[OPENLINEAGE] 2025-08-12T23:20:13.895587 TRANSFORM id=344ae4d3-dcea-4fa4-8753-9684a44f3e4a payload={'op': 'rename', 'params': {'mapping': {'Survived': 'survived'}}}
[OPENLINEAGE] 2025-08-12T23:20:13.895587 TRANSFORM id=5cd28351-847c-4011-833d-4785f2c97aaf payload={'op': 'safe_cast', 'params': {'mapping': {'survived': 'int'}}}
[OPENLINEAGE] 2025-08-12T23:20:13.897595 TRANSFORM id=042720ea-4144-48c9-ae1e-4db2ed56ba3f payload={'op': 'drop_duplicates', 'params': {}}
[OPENLINEAGE] 2025-08-12T23:20:13.897595 WRITE id=bee4a712-da53-4f07-b49c-ff912b9ad4e1 payload={'dataset': 'titanic_clean_godml.csv', 'connector': 'csv'}
✅ DataPrep OK — shape: (775, 9)
{'shape': [775, 9],
'nulls': {'survived': 0,
'Pclass': 0,
'Age': 0,
'SibSp': 0,
'Parch': 0,
'Fare': 0,
'Sex_male': 0,
'Embarked_Q': 0,
'Embarked_S': 0},
'dtypes': {'survived': 'Int64',
'Pclass': 'Int64',
'Age': 'float64',
'SibSp': 'Int64',
'Parch': 'Int64',
'Fare': 'float64',
'Sex_male': 'bool',
'Embarked_Q': 'bool',
'Embarked_S': 'bool'},
'unique': {'survived': 2,
'Pclass': 3,
'Age': 88,
'SibSp': 7,
'Parch': 7,
'Fare': 248,
'Sex_male': 2,
'Embarked_Q': 2,
'Embarked_S': 2}}
🔒 Compliance demo (PCI‑DSS)
from godml.notebook_api import GodmlNotebook
nbk = GodmlNotebook()
model_type = "random_forest"
hyperparams = {"max_depth": 5, "n_estimators": 100}
dataset_path = "./data/titanic_clean_godml.csv" # asegúrate de que existe
pipeline = nbk.create_pipeline(
name="titanic_rf_model",
model_type=model_type,
hyperparameters=hyperparams,
dataset_path=dataset_path
)
pipeline # para ver el objeto
WARNING:tensorflow:From C:\Users\arturo\Documents\dev_godml\venv\Lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
PipelineDefinition(name='titanic_rf_model', version='1.0.0', provider='mlflow', description=None, dataset=DatasetConfig(uri='./data/titanic_clean_godml.csv', hash='auto', target=None, dataprep=None), model=ModelConfig(type='random_forest', source='core', hyperparameters=Hyperparameters(max_depth=5, eta=None, objective=None, n_estimators=100, max_features=None, random_state=None)), metrics=[Metric(name='auc', threshold=0.8)], governance=Governance(owner='notebook-user@company.com', compliance=None, tags=[{'source': 'jupyter'}]), deploy=DeployConfig(realtime=False, batch_output='./outputs/titanic_rf_model_predictions.csv'))
🤖 Train / Predict / Evaluate (RandomForest)
# GODML Train → Predict → Evaluate
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from godml import notebook_api as nb
df = pd.read_csv("./data/titanic_clean.csv")
target_col = "survived"
assert target_col in df.columns, "El target 'survived' no está en el dataset limpio. Revisa la receta."
# GODML espera numéricos para entrenar (nuestra receta ya dejó Sex_/Embarked_ en dummies)
X_all = df.drop(columns=[target_col]).select_dtypes(include=[np.number])
y_all = df[target_col].astype(int)
# Validación extra: si queda alguna columna no numérica, paremos con mensaje claro
non_numeric = [c for c in df.drop(columns=[target_col]).columns if df[c].dtype == "object"]
assert not non_numeric, f"Hay columnas no numéricas {non_numeric}. Ajusta la receta (select/one_hot)."
# Estratificación segura: solo si hay ≥2 ejemplos por clase
do_stratify = y_all.nunique() <= 20 and y_all.value_counts().min() >= 2
strata = y_all if do_stratify else None
X_train, X_test, y_train, y_test = train_test_split(
X_all, y_all, test_size=0.25, random_state=42, stratify=strata
)
# GODML: train_model es el helper que ya ajustamos para wrappers (fit/train/estimator)
res = nb.train_model(
"random_forest",
X_train, y_train,
hyperparams={"n_estimators": 200, "max_depth": 6}
)
# GODML: predict/evaluate helpers
y_pred = nb.predict(res, X_test)
metrics = nb.evaluate(y_test, y_pred, ["accuracy", "precision", "recall", "f1", "roc_auc"])
metrics
{'accuracy': 0.6237113402061856,
'precision': 0.5411764705882353,
'recall': 0.575,
'f1': 0.5575757575757576,
'roc_auc': 0.6989035087719299}
🚀 Entrenando modelo con MLflow: xgboost-quick-train
from godml.notebook_api import quick_train
quick_train(
model_type="xgboost",
hyperparameters={"eta": 0.1, "max_depth": 4},
dataset_path="./data/titanic_clean.csv"
)
🚀 Entrenando modelo con MLflow: xgboost-quick-train
ℹ️ CWD = C:\Users\arturo\Documents\dev_godml\
ℹ️ dataset.uri = ./data/titanic_clean.csv (abs: C:\data\titanic_clean.csv)
ℹ️ dataset.target = None
ℹ️ dataset.dataprep presente? -> False
📥 Cargando dataset limpio desde ruta: C:\data\titanic_clean.csv
✅ Modelo xgboost cargado desde core
📊 Métricas:
- auc: 0.8621
- accuracy: 0.8194
- precision: 0.8103
- recall: 0.7344
- f1: 0.7705
✅ Entrenamiento finalizado. AUC: 0.8621
✅ Modelo registrado con éxito: xgboost-quick-train-xgboost
📦 Predicciones guardadas en: C:\outputs\xgboost-quick-train_predictions.csv
Registered model 'xgboost-quick-train-xgboost' already exists. Creating a new version of this model...
Created version '6' of model 'xgboost-quick-train-xgboost'.
'✅ Modelo entrenado exitosamente'
import numpy as np
# Si predict devolvió clases, generamos proba dummy para graficar (no ideal).
y_pred_arr = np.asarray(y_pred)
if set(np.unique(y_pred_arr)) <= {0,1}:
print("Nota: predict devolvió clases. Intento usar predict_proba vía wrapper si está disponible.")
# nb.predict ya intenta usar predict_proba cuando existe, así que reusamos y_pred directamente.
# Si tu wrapper devolvió clases, ajusta la llamada a nb.predict para obtener proba (según tu modelo).
nb.plot_roc_pr_curves(y_test, y_pred_arr if y_pred_arr.ndim == 1 else y_pred_arr[:, 1])
<Figure size 640x480 with 1 Axes>
<Figure size 640x480 with 1 Axes>
results = []
# RF
rf = nb.train_model("random_forest", X_train, y_train, {"n_estimators": 300, "max_depth": 6})
rf_pred = nb.predict(rf, X_test)
rf_metrics = nb.evaluate(y_test, rf_pred, ["accuracy", "roc_auc"])
setattr(rf, "metrics", rf_metrics)
results.append(rf)
try:
xgb = nb.train_model("xgboost", X_train, y_train, {"eta": 0.3, "max_depth": 4})
xgb_pred = nb.predict(xgb, X_test)
xgb_metrics = nb.evaluate(y_test, xgb_pred, ["accuracy", "roc_auc"])
setattr(xgb, "metrics", xgb_metrics)
results.append(xgb)
except Exception as e:
print("XGBoost no disponible o no configurado. Skipping. Motivo:", e)
nb.compare_models(results, by="roc_auc")
model accuracy roc_auc
0 XgboostModel 0.680412 0.722478
1 RandomForestModel 0.623711 0.698904
📌 1. Histograma de probabilidades de supervivencia
from pathlib import Path
path = Path("./titanic_rf_artifact.joblib")
nb.save_artifact({"model": rf.model, "metrics": rf_metrics}, path)
loaded = nb.load_artifact(path)
loaded.keys(), type(loaded["model"]).__name__
(dict_keys(['model', 'metrics']), 'RandomForestModel')
📌 2. Clasificación binaria y conteo
from godml.notebook_api import GodmlNotebook
g = GodmlNotebook()
g.create_pipeline(
name="titanic-demo",
model_type="xgboost", # o "random_forest" si prefieres
hyperparameters={"eta": 0.3, "max_depth": 3},
dataset_path="./data/titanic_clean.csv", # usamos el CSV ya limpio o el original
)
g.train()
🚀 Entrenando modelo con MLflow: titanic-demo
ℹ️ CWD = C:\Users\arturo\Documents\dev_godml\
ℹ️ dataset.uri = ./data/titanic_clean.csv
ℹ️ dataset.target = None
ℹ️ dataset.dataprep presente? -> False
📥 Cargando dataset limpio desde ruta: \data\titanic_clean.csv
✅ Modelo xgboost cargado desde core
📊 Métricas:
- auc: 0.8625
- accuracy: 0.8258
- precision: 0.8246
- recall: 0.7344
- f1: 0.7769
✅ Entrenamiento finalizado. AUC: 0.8625
✅ Modelo registrado con éxito: titanic-demo-xgboost
📦 Predicciones guardadas en: \outputs\titanic-demo_predictions.csv
Successfully registered model 'titanic-demo-xgboost'.
Created version '1' of model 'titanic-demo-xgboost'.
'✅ Entrenamiento completado'
summary = nb.summarize_df(df)
summary
{'shape': [775, 9],
'nulls': {'survived': 0,
'Pclass': 0,
'Age': 0,
'SibSp': 0,
'Parch': 0,
'Fare': 0,
'Sex_male': 0,
'Embarked_Q': 0,
'Embarked_S': 0},
'dtypes': {'survived': 'int64',
'Pclass': 'int64',
'Age': 'float64',
'SibSp': 'int64',
'Parch': 'int64',
'Fare': 'float64',
'Sex_male': 'bool',
'Embarked_Q': 'bool',
'Embarked_S': 'bool'},
'unique': {'survived': 2,
'Pclass': 3,
'Age': 88,
'SibSp': 7,
'Parch': 7,
'Fare': 248,
'Sex_male': 2,
'Embarked_Q': 2,
'Embarked_S': 2}}