In today's data-driven world, mobile applications are increasingly expected to do more than just display information; they need to be intelligent. Imagine a React Native app that doesn't just allow users to upload documents but can also understand, categorize, and process them automatically. This is precisely what we set out to achieve by combining the mobile prowess of React Native with the backend strength of Python and Machine Learning for intelligent PDF classification. This post chronicles our journey, detailing the architecture, implementation challenges, and the powerful synergy between these technologies.
The Vision: Smart Document Handling on Mobile
The core challenge was to move beyond simple document storage. We wanted our React Native application to:
- Allow users to seamlessly upload PDF documents from their mobile devices.
- Intelligently classify these PDFs into predefined categories (e.g., invoices, contracts, reports).
- Provide users with quick insights and enable automated downstream workflows based on the document type.
To realize this, we knew a robust backend capable of handling complex PDF processing and machine learning inference was essential. Python, with its rich ML ecosystem and powerful web frameworks, was the natural choice.
The Blueprint: System Architecture Overview
Our system is designed as a set of collaborating services, ensuring scalability and maintainability:
- React Native Frontend:
- Provides the user interface for document uploads.
- Manages interactions, displays processing status, and visualizes classification results.
- Communicates with the backend API.
- Python Backend (FastAPI Microservice):
- The workhorse for our intelligent processing. Built with FastAPI for its high performance, asynchronous capabilities, and ease of use.
- Handles incoming PDF uploads from the React Native app.
- Orchestrates the PDF processing pipeline (text extraction, preprocessing).
- Integrates with the Machine Learning model for document classification.
- Manages asynchronous tasks for potentially long-running operations.
- Machine Learning Model (Python):
- The "brain" of the system. A pre-trained model (e.g., using Scikit-learn, TensorFlow, or PyTorch) responsible for classifying documents based on their textual content.
- Involves components for text feature extraction (like TF-IDF or word embeddings) and the classification algorithm itself.
- Supporting Backend Infrastructure:
- API Gateway (Optional but Recommended): A single entry point for all client requests, handling routing, authentication, and rate limiting. (e.g., Nginx, AWS API Gateway).
- Task Queue (Celery & Redis/RabbitMQ): For offloading time-consuming tasks like PDF parsing and ML inference, ensuring the API remains responsive.
- Storage Solution (e.g., AWS S3, MinIO): For storing uploaded PDFs and potentially their processed versions or metadata.
Digging into Implementation Details
Let's explore the core code components that bring this system to life.
1. The Python Backend Service with FastAPI
FastAPI allows us to build efficient, asynchronous API endpoints.
# main.py (FastAPI application)
from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.middleware.cors import CORSMiddleware # For local development
import logging # For better logging
# Assuming these modules are in your project structure
from .ml_processor import DocumentClassifier, MLProcessingError
from .pdf_processor import PDFExtractor, PDFExtractionError
# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(title="Intelligent PDF Processor")
# CORS for local development (adjust origins for production)
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # Restrict this in production!
allow_credentials=True,
allow_methods=["POST", "GET"],
allow_headers=["*"],
)
# Initialize our processors once (singleton pattern for efficiency)
pdf_extractor = PDFExtractor()
document_classifier = DocumentClassifier() # Loads ML model on init
@app.post("/api/v1/classify-document/")
async def classify_pdf_document(file: UploadFile = File(...)):
"""
Receives a PDF file, extracts text, classifies it,
and returns the classification and confidence.
"""
if not file.filename.lower().endswith(".pdf"):
raise HTTPException(status_code=400, detail="Invalid file type. Only PDFs are accepted.")
try:
logger.info(f"Processing file: {file.filename}")
# Extract text from the PDF
# UploadFile.read() is an async operation
pdf_content_bytes = await file.read()
extracted_text = await pdf_extractor.extract_text_from_bytes(pdf_content_bytes)
if not extracted_text or extracted_text.strip() == "":
logger.warning(f"No text could be extracted from {file.filename}")
raise HTTPException(status_code=422, detail="Could not extract text from PDF.")
# Classify the extracted text
classification_result = await document_classifier.classify(extracted_text)
logger.info(f"File {file.filename} classified as: {classification_result['category']} with confidence: {classification_result['confidence']:.2f}")
return {
"filename": file.filename,
"classification": classification_result["category"],
"confidence": classification_result["confidence"]
}
except PDFExtractionError as e:
logger.error(f"PDF Extraction Error for {file.filename}: {e}")
raise HTTPException(status_code=500, detail=f"Error extracting text from PDF: {e}")
except MLProcessingError as e:
logger.error(f"ML Processing Error for {file.filename}: {e}")
raise HTTPException(status_code=500, detail=f"Error during document classification: {e}")
except Exception as e:
logger.error(f"Unexpected error processing {file.filename}: {e}", exc_info=True)
raise HTTPException(status_code=500, detail="An unexpected server error occurred.")
@app.get("/health")
async def health_check():
return {"status": "healthy", "message": "PDF Classification Service is Up!"}
- We use
UploadFile
for efficient handling of file uploads. - The endpoint is
async
, allowing concurrent request handling. - Error handling is crucial for robustness.
2. The Machine Learning Model Implementation
This class encapsulates our ML model logic.
# ml_processor.py
import joblib # Or pickle, for loading scikit-learn models
from sklearn.feature_extraction.text import TfidfVectorizer
# from sklearn.ensemble import RandomForestClassifier # Example classifier
import numpy as np
import os
MODEL_PATH = os.getenv("MODEL_PATH", "path/to/your/trained_model.joblib")
VECTORIZER_PATH = os.getenv("VECTORIZER_PATH", "path/to/your/vectorizer.joblib")
class MLProcessingError(Exception):
"""Custom exception for ML processing errors."""
pass
class DocumentClassifier:
def __init__(self):
try:
self.model = joblib.load(MODEL_PATH)
self.vectorizer: TfidfVectorizer = joblib.load(VECTORIZER_PATH) # Type hint for clarity
if not hasattr(self.model, 'predict_proba'):
raise MLProcessingError("Model does not support probability estimates (predict_proba).")
except FileNotFoundError:
raise MLProcessingError(f"ML model or vectorizer not found at specified paths. Ensure MODEL_PATH and VECTORIZER_PATH are set.")
except Exception as e:
raise MLProcessingError(f"Error loading ML model or vectorizer: {e}")
async def classify(self, text: str) -> dict:
if not text or not isinstance(text, str):
raise ValueError("Input text cannot be empty or non-string.")
try:
# Preprocess text (ensure vectorizer is fitted on similar preprocessed text)
# This example assumes the vectorizer handles necessary preprocessing.
# In a real scenario, you might have a separate preprocessing function.
text_features = self.vectorizer.transform([text])
# Get prediction and confidence
prediction = self.model.predict(text_features)
probabilities = self.model.predict_proba(text_features)
# Get the highest probability and its corresponding class index
confidence_score = float(np.max(probabilities[0]))
predicted_class = prediction[0] # Assuming single prediction
return {"category": str(predicted_class), "confidence": confidence_score}
except Exception as e:
# Log the error for debugging
# logger.error(f"Error during classification: {e}", exc_info=True)
raise MLProcessingError(f"Failed to classify text: {e}")
# Placeholder for actual model loading logic if not using joblib/pickle
# def load_model(self): pass
# def load_vectorizer(self): pass
- We load a pre-trained model and vectorizer (e.g., TF-IDF) during initialization.
- The
classify
method takes text, vectorizes it, and uses the model to predict the category and confidence score.
3. React Native Integration: Uploading and Displaying Results
On the mobile side, we need to handle file picking, uploading, and displaying the results.
// screens/DocumentProcessorScreen.js
import React, { useState } from 'react';
import { View, Button, Text, ActivityIndicator, StyleSheet, Alert } from 'react-native';
import DocumentPicker, { DocumentPickerResponse } from 'react-native-document-picker';
// Assume you have an API service configured (e.g., using Axios)
import apiClient from '../services/apiClient'; // Your API client
const styles = StyleSheet.create({
container: { flex: 1, alignItems: 'center', justifyContent: 'center', padding: 20 },
statusText: { marginVertical: 10, fontSize: 16 },
resultContainer: { marginTop: 20, padding: 15, borderWidth: 1, borderColor: '#ccc', borderRadius: 5 },
filenameText: { fontWeight: 'bold' },
// Add more styles as needed
});
const DocumentProcessorScreen = () => {
const [isProcessing, setIsProcessing] = useState(false);
const [classificationResult, setClassificationResult] = useState(null);
const [selectedFile, setSelectedFile] = useState(null);
const handleDocumentPick = async () => {
try {
const pickerResult = await DocumentPicker.pickSingle({
type: [DocumentPicker.types.pdf],
});
setSelectedFile(pickerResult);
setClassificationResult(null); // Clear previous results
// Automatically process after picking, or add a separate button
await processDocument(pickerResult);
} catch (err) {
if (DocumentPicker.isCancel(err)) {
console.log('User cancelled the document picker');
} else {
Alert.alert('Error', 'Could not pick the document.');
console.error('DocumentPicker Error: ', err);
}
}
};
const processDocument = async (fileToProcess) => {
if (!fileToProcess) {
Alert.alert('No file selected', 'Please pick a document first.');
return;
}
setIsProcessing(true);
setClassificationResult(null);
const formData = new FormData();
formData.append('file', {
uri: fileToProcess.uri,
type: fileToProcess.type, // e.g., 'application/pdf'
name: fileToProcess.name,
});
try {
// Replace '/api/v1/classify-document/' with your actual endpoint
const response = await apiClient.post('/api/v1/classify-document/', formData, {
headers: {
'Content-Type': 'multipart/form-data',
},
// Optional: Add a timeout for the request
// timeout: 30000, // 30 seconds
});
setClassificationResult(response.data);
Alert.alert('Success', `Document classified as: ${response.data.classification}`);
} catch (error) {
console.error('Processing failed:', error.response?.data || error.message);
Alert.alert('Processing Failed', error.response?.data?.detail || 'Could not connect to the server.');
} finally {
setIsProcessing(false);
}
};
return (
<View style={styles.container}>
<Button title="Pick & Classify PDF Document" onPress={handleDocumentPick} disabled={isProcessing} />
{selectedFile && !isProcessing && (
<Text style={styles.statusText}>Selected: {selectedFile.name}</Text>
)}
{isProcessing && (
<>
<ActivityIndicator size="large" color="#0000ff" style={{ marginVertical: 20 }} />
<Text style={styles.statusText}>Processing: {selectedFile?.name || 'document'}...</Text>
</>
)}
{classificationResult && !isProcessing && (
<View style={styles.resultContainer}>
<Text style={styles.filenameText}>File: {classificationResult.filename}</Text>
<Text>Classification: {classificationResult.classification}</Text>
<Text>Confidence: {(classificationResult.confidence * 100).toFixed(2)}%</Text>
</View>
)}
</View>
);
};
export default DocumentProcessorScreen;
- We use
react-native-document-picker
to allow users to select a PDF. - A
FormData
object is used to send the file to the backend. - Loading states and result display provide user feedback.
Crafting the ML Pipeline: From PDF to Prediction
1. Text Extraction from PDFs
The first step is to get usable text out of the PDF.
# pdf_processor.py
from PyPDF2 import PdfReader # A common library for PDF reading
# For more complex PDFs, consider pdfminer.six or Apache Tika (via a subprocess)
import io
import re
class PDFExtractionError(Exception):
"""Custom exception for PDF extraction errors."""
pass
class PDFExtractor:
async def extract_text_from_bytes(self, pdf_bytes: bytes) -> str:
try:
pdf_file = io.BytesIO(pdf_bytes)
reader = PdfReader(pdf_file)
text_parts = [page.extract_text() for page in reader.pages if page.extract_text()]
full_text = " ".join(text_parts)
return await self._preprocess_extracted_text(full_text)
except Exception as e:
# logger.error(f"Error extracting text from PDF bytes: {e}", exc_info=True)
raise PDFExtractionError(f"Failed to read or extract text from PDF: {e}")
async def _preprocess_extracted_text(self, text: str) -> str:
if not text:
return ""
# Remove excessive newlines and whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Remove special characters (customize based on your needs)
# text = re.sub(r'[^a-zA-Z0-9\s.,!?]', '', text)
# Convert to lowercase
text = text.lower()
# Add more text cleaning rules as necessary (e.g., stemming, lemmatization if your model expects it)
return text
- This uses
PyPDF2
. For scanned PDFs (images), you'd need an OCR step (e.g., Tesseract OCR viapytesseract
). - Basic text preprocessing (lowercase, remove extra spaces) is essential.
2. Model Training (A Conceptual Overview)
The actual model training is an offline process:
# conceptual_model_training.py
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier # Example
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
import joblib
import pandas as pd # Assuming data is in a CSV or similar
def train_and_save_classifier(data_path="path/to/your/labeled_data.csv", text_column="text", label_column="category"):
"""
Loads data, trains a text classifier, and saves the pipeline.
"""
df = pd.read_csv(data_path)
# Ensure no missing values in text or labels
df.dropna(subset=[text_column, label_column], inplace=True)
X_train, X_test, y_train, y_test = train_test_split(
df[text_column], df[label_column], test_size=0.2, random_state=42, stratify=df[label_column]
)
# Create a pipeline: TF-IDF Vectorizer -> Classifier
# Hyperparameters should be tuned using GridSearchCV or similar
text_clf_pipeline = Pipeline([
('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=0.95, min_df=2)),
('clf', RandomForestClassifier(n_estimators=100, random_state=42, class_weight='balanced'))
])
print("Training model...")
text_clf_pipeline.fit(X_train, y_train)
print("Evaluating model...")
y_pred = text_clf_pipeline.predict(X_test)
print(classification_report(y_test, y_pred))
# Save the entire pipeline (vectorizer + classifier)
joblib.dump(text_clf_pipeline.named_steps['tfidf'], VECTORIZER_PATH) # Save vectorizer
joblib.dump(text_clf_pipeline.named_steps['clf'], MODEL_PATH) # Save model
print(f"Vectorizer saved to {VECTORIZER_PATH}")
print(f"Model saved to {MODEL_PATH}")
# if __name__ == "__main__":
# # Ensure MODEL_PATH and VECTORIZER_PATH are defined before calling
# # e.g., MODEL_PATH = "trained_document_classifier.joblib"
# # VECTORIZER_PATH = "tfidf_vectorizer.joblib"
# train_and_save_classifier()
- This involves loading labeled documents, vectorizing the text (e.g., TF-IDF), and training a classifier (e.g., RandomForest, SVM, or a neural network).
- The trained model and vectorizer are saved to be loaded by the FastAPI service.
Scaling Up: Microservices & Asynchronous Processing
For production, especially with potentially long PDF processing times, a more robust architecture is needed.
1. API Gateway (Conceptual)
An API Gateway (like Nginx, Kong, or cloud provider services) would sit in front of your FastAPI service(s) to handle:
- Request routing
- Authentication & Authorization
- Rate limiting
- SSL termination
2. Asynchronous Task Queue with Celery & Redis
Long-running tasks should not block your API.
# tasks.py (Celery tasks)
from celery import Celery
# Assuming your FastAPI app and processors are importable
# from .main import pdf_extractor, document_classifier # Adjust import path as needed
# Configure Celery: Replace with your Redis/RabbitMQ URL
# The broker URL points to your message broker (Redis in this case).
# The backend URL is where Celery stores task results (also Redis here).
celery_app = Celery('pdf_tasks', broker='redis://localhost:6379/0', backend='redis://localhost:6379/0')
# It's better to initialize these within the task if they are not picklable
# or if they manage resources that shouldn't be shared across processes.
# For this example, assuming they can be globally initialized or re-initialized in task.
@celery_app.task(name="classify_pdf_task")
def classify_pdf_task(file_content_bytes: bytes, filename: str) -> dict:
"""
Celery task to process and classify a PDF.
This task runs asynchronously in a Celery worker.
"""
# Re-initialize or ensure processors are safe to use here
# This is important for stateful objects or those with external connections
from main import pdf_extractor, document_classifier # Example of re-importing/re-init
logger.info(f"[Celery Task] Processing file: {filename}")
try:
extracted_text = pdf_extractor.extract_text_from_bytes_sync(file_content_bytes) # Sync version for Celery
if not extracted_text or extracted_text.strip() == "":
logger.warning(f"[Celery Task] No text extracted from {filename}")
return {"error": "Could not extract text from PDF."}
classification_result = document_classifier.classify_sync(extracted_text) # Sync version for Celery
logger.info(f"[Celery Task] File {filename} classified as: {classification_result['category']}")
# Here you would typically store the result in a database associated with a task ID
# and notify the client (e.g., via WebSockets, SSE, or polling).
return {
"filename": filename,
"classification": classification_result["category"],
"confidence": classification_result["confidence"],
"status": "completed"
}
except Exception as e:
logger.error(f"[Celery Task] Error processing {filename}: {e}", exc_info=True)
# Store error information
return {"error": str(e), "status": "failed"}
# In your FastAPI endpoint, instead of direct processing:
# from .tasks import classify_pdf_task
# @app.post("/api/v1/classify-document-async/")
# async def classify_pdf_document_async(file: UploadFile = File(...)):
# file_content_bytes = await file.read()
# task = classify_pdf_task.delay(file_content_bytes, file.filename) # Use .delay() or .apply_async()
# return {"task_id": task.id, "status": "processing_queued"}
- The FastAPI endpoint would submit a job to the Celery queue and immediately return a task ID.
- React Native app would then poll for the result using the task ID or receive real-time updates via WebSockets/Server-Sent Events.
Fine-Tuning Performance
Caching Strategies
FastAPI-Cache can cache responses from frequently accessed, non-changing data.
# In main.py, for caching specific endpoints if needed
# from fastapi_cache import FastAPICache
# from fastapi_cache.backends.redis import RedisBackend
# from fastapi_cache.decorator import cache
# import aioredis
# @app.on_event("startup")
# async def startup_event():
# # Ensure Redis is running and accessible
# redis_url = "redis://localhost:6379"
# try:
# redis = await aioredis.from_url(redis_url, encoding="utf8", decode_responses=True)
# FastAPICache.init(RedisBackend(redis), prefix="fastapi-cache")
# logger.info(f"FastAPI-Cache initialized with Redis backend at {redis_url}")
# except Exception as e:
# logger.error(f"Could not connect to Redis for caching: {e}")
# @app.get("/some-expensive-readonly-data")
# @cache(expire=300) # Cache for 5 minutes
# async def get_expensive_data():
# # result = await some_very_long_database_query()
# # return result
# pass
Batch Processing for ML Inference
If you have many documents to classify, batching them for ML inference can be more efficient, especially if using GPUs.
# Conceptual BatchProcessor in ml_processor.py
# class DocumentClassifier:
# ...
# async def classify_batch(self, texts: list[str]) -> list[dict]:
# if not texts: return []
# try:
# text_features = self.vectorizer.transform(texts)
# predictions = self.model.predict(text_features)
# probabilities = self.model.predict_proba(text_features)
#
# results = []
# for i, pred_class in enumerate(predictions):
# confidence = float(np.max(probabilities[i]))
# results.append({"category": str(pred_class), "confidence": confidence})
# return results
# except Exception as e:
# raise MLProcessingError(f"Failed to classify batch: {e}")
The Road Ahead: Future Enhancements
This system provides a solid foundation. Future improvements could include:
- More Sophisticated ML Models: Utilizing NLP transformers (like BERT) for deeper contextual understanding and layout-aware models for visually complex PDFs.
- Real-Time Progress & Results: Implementing WebSockets or Server-Sent Events for instant feedback to the React Native app instead of polling.
- Advanced Document Analysis: Extracting not just classifications but also key entities, summaries, or performing OCR on scanned documents.
- Enhanced Error Handling & Resilience: More granular error reporting and retry mechanisms.
- Scalability & Cost Optimization: Auto-scaling Celery workers, optimizing cloud resource usage.
- Security Hardening: Thorough input validation, authentication/authorization for APIs, secure storage of sensitive data.
Conclusion: A Powerful Partnership for Intelligent Apps
Combining React Native's frontend agility with Python's backend and ML capabilities creates a potent recipe for building truly intelligent mobile applications. This PDF classification system is just one example of how this partnership can automate complex tasks, provide valuable insights, and ultimately deliver a richer user experience. While the journey involves navigating different ecosystems, the synergy achieved makes it a worthwhile endeavor for any team looking to push the boundaries of mobile app intelligence.
What are your thoughts on integrating ML backends with mobile apps? Share your experiences or questions in the comments below!